Re: [MarkLogic Dev General] Processing Large Documents?

Damon Feldman Mon, 20 Feb 2012 07:26:36 -0800

Todd,

There's a deeper purpose as well. In document-oriented programming generally, 
the documents generally correspond to natural items in the business domain, or 
actual documents (pdf, web forms) that were input to the system. Fragment size 
is not the issue per se - it's more about the programming model and natural 
grouping of data.


E.g. in a customer database, you'd likely store individual customers as 
documents, since you'll search, load and store them as conceptual units. It's 
then easy to query for all customers in state="OH" who's status is "pending" 
since indexes tell which documents match these criteria. If you broke the 
address out into a separate, smaller document this becomes harder again, so 
smaller is not always better. 10Kb to 200Kb per document is common.

To support this, MarkLogic indexes documents (by keyword, values and structure) 
and the optimized unit of read/write is the document. This makes accessing 
documents (customer in the example above) more natural and faster.

Yours,
Damon

From: [email protected] 
[mailto:[email protected]] On Behalf Of Todd Gochenour
Sent: Monday, February 20, 2012 1:57 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Documents?

This advice repeats a recommendation I saw earlier tonight during some of my 
research, namely that with MarkLogic it's better to break up documents into 
smaller fragments.  I guess there's a performance gain in bursting a document 
into small fragments, something to do with concurrency and locking or 
minimizing the depth of the hierarchy, perhaps?

Note that my document doesn't equate to tables but instead it equates to the 
entire database, which is two levels away from this recommendation to have 
documents equate to rows.  It seems like the conventional wisdom is to burst 
large documents into smaller fragments so that each fragment can be handled 
independently.  I've always felt it simpler and more accurate to load and use 
the XML file as is and not shred it into multiple parts.  I want to replace the 
MySQL database with an XML database for this very reason.

So I've managed to load this large document into the database and I've done my 
first transformation of this document using XQuery to perform the extraction 
and performance seems rather impressive.   I've done the same thing with both 
eXistDB and xDB with no problem, indexing everything including the deep 
hierarchical structure.  Once in the database, I should be able to update 
fragments within the document as easily as if these fragments were burst into 
individual files.  Is there a technical reason (I've yet to discover) for why 
this would not be the case?

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Documents?

Reply via email to