RE: [MarkLogic Dev General] RE: Fragmentation planning

Lee, David Sun, 20 Dec 2009 15:27:44 -0800

I was looking at the Registered Queries and they confuse me a bit.
What is the lifetime of a Registered Query ? Can I put the ID's in the
DB itself and save them for a long time or are they lost when the
session or server ends ?



-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Kelly
Stirman
Sent: Sunday, December 20, 2009 6:18 PM
To: [email protected]
Subject: [MarkLogic Dev General] RE: Fragmentation planning

Some thoughts that may be useful in designing how you organize documents
in your database:

-Generally speaking, the more restrictive your query, the more
efficiently it can be resolved by the server. As a general rule, the
time it takes to resolve your query usually is proportional to the
number of results rather than the complexity of the query. So, a complex
query with many constraints that is highly restrictive and returns few
results can be resolved more quickly than a very simple query with few
constraints that returns many results. Partitions can be used to make
your queries more restrictive, and while they may make your queries more
complex, in many cases they can improve the performance of your
application.

-There exist both physical and logical partitions in MarkLogic. 

-Forests are physical partitions. Because queries are evaluated in
forests in parallel, it is normally best to use the default
configuration of the server which spreads documents across forests. This
allows MarkLogic to "divide and conquer" the work associated with a
query. Of course, you need sufficient hardware to accommodate the
parallel work, which is why we typically recommend one forest for every
pair of CPU cores. (There are some applications for which designing your
own policies around document placement is the right approach, but that
should be covered in a separate thread.)

-There are many forms of logical partitions. Directories and collections
are good examples. They are both very fast for queries and for delete
operations. There's no reason not to combine them in your design.
Collections are very cheap, so you might consider using several with any
document. 

-XML can be another good was to partition your database, as you have
probably found. Using simple structures that are suitable for
element-value-query or element-attribute-value-query is one of the best
ways to partition with XML.

-Document properties are a good way to partition your database. Joins
between the document fragment and its property fragment are optimized
for simple properties when using cts:properties-query(). This allows you
to use XML for partitioning when you cannot control the schema for your
documents, or if you're dealing with binary or text documents.

-Security is another way to partition you database. Ultimately, security
metadata is part of the indexes in a way that is similar to collections.

-Registered queries are a remarkably powerful way to partition your
database. Registered queries allow you to define a partition based on
any cts:query. A registered query is similar to a materialized view,
except in this case the materialization only happens in the indexes.
Take any complex unfiltered query, register it, and after you pay the
costs of running the query the first time, the next time it will be as
fast as a simple element-value-query. Plus, the registered query works
with updates.

I hope some of this helps in your efforts to organize your database.

Kelly
 
Message: 2
Date: Sat, 19 Dec 2009 16:41:29 -0800
From: "Lee, David" <[email protected]>
Subject: RE: [MarkLogic Dev General] RE: Fragmentation planning
To: "General Mark Logic Developer Discussion"
        <[email protected]>
Message-ID: <dd37f70d78609d4e9587d473fc61e0a714ccc...@postoffice>
Content-Type: text/plain;       charset="iso-8859-1"

First off, the disclaimer that I'm not a MarkLogic expert, I'm just
learning myself, so I welcome anyone who knows more to disagree with me.

That said though, I dont believe queries will be slower or faster based
on what directory structure you use.  cts:search() seems to me to
perform equally well regardless how the directory structure is setup.
There are many ways of using it and I'm just learning to scratch the
surface.
But examples of what I belive will be equally quick to search are
        cts:serch( xdmp:directory( ..) , ... )    -- your original idea 
        cts:search( //element , ... )                   -- search based
on an element name, regardless of the URIS.
        cts:search( xdmp:collection(...) ... )  -- limit based on a
collection

what seems interesting to me and I'm just barely getting a handle on it
is that you can 're factor' your searches in many different ways with
consistent performance characteristics.
Example
        cts:search( //p , cts:and-query( cts:directory-query("dir") ,
cts:word-query("word") ))

This performs in my tests equally well as something like
        cts:search( xdmp:directory("dir" ) , cts:element-value-query(
... )) 

So I suggest you have a mistaken presumption that organizing things in
directories has any benefit at all in search speed.   It has *other*
benefits but searching seems to work well all over the board reguardless
of what URI you assign to documents.   Its really amazing actually.


As for the benifits of a RESTful style for organizing the directory
tree, based on patient as the root, the main benifit I suggest is that
it becomes an easy mapping for a web service, if your primary types of
queries are about a particular patient.  A client using a restuful
approach can (with some help from URI rewriting rules in the App module)
can have what seems a "natural" view on patient data
        /patient/patient_id/     -- maybe combine ALL the sub directory
docuements into 1 
        /patient/patient_id/lab_tests -- all lab tests
        /patient/patient_id/lab_tests/test_123  - a single lab test 

etc

It also helps if you map this tree to a WebDAV view ... files are easier
to navigate from a simple file explorer.

Moving, copying, updating, adding or deleting data becomes a directory
operation without having to know anything at all about the structure
(contents,elements etc) of the decomposed files.
The directory structure can be used explicitly to navigate and
manipulate data associated with a patient with no knowledge *at all*
about the contents of files. 
This can become extremely convenient when you toss in non-XML files to
the mix, such as say lab XRay images (jpg, gif) ... You can of course
assign XML properties to non-xml files but if you simply put them in a
patient oriented directory structure life is simplified like
        /patient/patient_id/lab_tests/test_123/images/


So in conclusion, I suggest you not worry about the efficiency of
searching when deciding on your directory or URI structure, and instead
choose a structure that has advantages based on organizing your data.
Searching works great regardless of your directory structure.
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] RE: Fragmentation planning

Reply via email to