RE: [MarkLogic Dev General] RE: Fragmentation planning

Lee, David Sat, 19 Dec 2009 16:41:36 -0800

First off, the disclaimer that I'm not a MarkLogic expert, I'm just learning 
myself, 
so I welcome anyone who knows more to disagree with me.

That said though, I dont believe queries will be slower or faster based on what 
directory structure you use.  cts:search() seems to me to perform equally well 
regardless how the directory structure is setup.  There are many ways of using 
it and I'm just learning to scratch the surface.
But examples of what I belive will be equally quick to search are
        cts:serch( xdmp:directory( ..) , ... )    -- your original idea 
        cts:search( //element , ... )                   -- search based on an 
element name, regardless of the URIS.
        cts:search( xdmp:collection(...) ... )  -- limit based on a collection

what seems interesting to me and I'm just barely getting a handle on it is that 
you can 're factor' your searches in many different ways with consistent 
performance characteristics.
Example
        cts:search( //p , cts:and-query( cts:directory-query("dir") , 
cts:word-query("word") ))

This performs in my tests equally well as something like
        cts:search( xdmp:directory("dir" ) , cts:element-value-query(  ... )) 

So I suggest you have a mistaken presumption that organizing things in 
directories has any benefit at all in search speed.   It has *other* benefits 
but searching seems to work well all over the board reguardless of what URI you 
assign to documents.   Its really amazing actually.

As for the benifits of a RESTful style for organizing the directory tree, based 
on patient as the root, the main benifit I suggest is that it becomes an easy 
mapping for a web service, if your primary types of queries are about a 
particular patient.  A client using a restuful approach can (with some help 
from URI rewriting rules in the App module) can have what seems a "natural" 
view on patient data
        /patient/patient_id/     -- maybe combine ALL the sub directory 
docuements into 1 
        /patient/patient_id/lab_tests -- all lab tests
        /patient/patient_id/lab_tests/test_123  - a single lab test 

etc

It also helps if you map this tree to a WebDAV view ... files are easier to 
navigate from a simple file explorer.

Moving, copying, updating, adding or deleting data becomes a directory 
operation without having to know anything at all about the structure 
(contents,elements etc) of the decomposed files.
The directory structure can be used explicitly to navigate and manipulate data 
associated with a patient with no knowledge *at all* about the contents of 
files. 
This can become extremely convenient when you toss in non-XML files to the mix, 
such as say lab XRay images (jpg, gif) ... You can of course assign XML 
properties to non-xml files but if you simply put them in a patient oriented 
directory structure life is simplified like
        /patient/patient_id/lab_tests/test_123/images/

So in conclusion, I suggest you not worry about the efficiency of searching 
when deciding on your directory or URI structure, and instead choose a 
structure that has advantages based on organizing your data.  Searching works 
great regardless of your directory structure.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Karl Erisman
Sent: Saturday, December 19, 2009 6:07 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] RE: Fragmentation planning

Well, since I'm using option (1), the issue is no longer of immediate
concern (option (1) involves single documents containing all types of
patient data, not separate documents for each type).  It would become
a concern, however, if performance proves to be inadequate and option
(3) is used to address the issue.

In considering option (3), I was originally thinking of the directory
structure you describe, but I thought that organizing things based on
type of data might make it faster to search across all patients for
specific data by using cts:directory-query() to limit the scope of the
search to the directory storing a particular type of data.  But I see
your point about using a directory structure that, in a RESTful sense,
models the patient as the primary resource.  I suppose then that the
XML structure of the individual documents would be the facility used
to narrow the scope of the search (i.e.
cts:element-query("demographics", <rest of the query>) assuming that
the demographic data documents have root node <demographics>).  I
would expect that to be slower, though (slower than using a
directory-query).

When you say that a structure reflecting the view of patients as the
primary resource has many benefits, are you thinking in terms of
ability to expose the data as a RESTful service?  What other benefits
are you thinking of?

Thanks,
Karl

On Sat, Dec 19, 2009 at 9:13 AM, Lee, David <[email protected]> wrote:
> I attended  workshop at Balisage 2009 where a developer was modeling very 
> similar data,
> HL7 based patient information.   In his case I dont think he was using 
> MarkLogic,  but the structure
> of the data and the rationale I think bear consideration.
> His design used directories for patient data but inverted from your structure.
> This design is more "restful" in the sense that the directory structure 
> itself models a aggregate model based around the patient, not the part (lab 
> test, info etc).  And the URI's follow a left-to-right decomposition of 
> document from container to contained.
>
> TO jump to the conclusion, I would suggest a structure
> Not like your suggestion
> /demographics/10291004
> /lab-results/10291004
>
> but rather
>
> /patients/10291004/lab-results/...
> /patients/10291004/demographics/...
>
> You can use Collections if you wish to group all 'lab-results' across all 
> patents
> but the primary directory structure is related to the patient,  which in a 
> patient data model is the
> primary (top level) object and having the directory structure directly 
> reflect that has many benefits.
>
>
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> [email protected]
> 812-482-5224
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] RE: Fragmentation planning

Reply via email to