Todd, There's a deeper purpose as well. In document-oriented programming generally, the documents generally correspond to natural items in the business domain, or actual documents (pdf, web forms) that were input to the system. Fragment size is not the issue per se - it's more about the programming model and natural grouping of data.
E.g. in a customer database, you'd likely store individual customers as documents, since you'll search, load and store them as conceptual units. It's then easy to query for all customers in state="OH" who's status is "pending" since indexes tell which documents match these criteria. If you broke the address out into a separate, smaller document this becomes harder again, so smaller is not always better. 10Kb to 200Kb per document is common. To support this, MarkLogic indexes documents (by keyword, values and structure) and the optimized unit of read/write is the document. This makes accessing documents (customer in the example above) more natural and faster. Yours, Damon From: [email protected] [mailto:[email protected]] On Behalf Of Todd Gochenour Sent: Monday, February 20, 2012 1:57 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Documents? This advice repeats a recommendation I saw earlier tonight during some of my research, namely that with MarkLogic it's better to break up documents into smaller fragments. I guess there's a performance gain in bursting a document into small fragments, something to do with concurrency and locking or minimizing the depth of the hierarchy, perhaps? Note that my document doesn't equate to tables but instead it equates to the entire database, which is two levels away from this recommendation to have documents equate to rows. It seems like the conventional wisdom is to burst large documents into smaller fragments so that each fragment can be handled independently. I've always felt it simpler and more accurate to load and use the XML file as is and not shred it into multiple parts. I want to replace the MySQL database with an XML database for this very reason. So I've managed to load this large document into the database and I've done my first transformation of this document using XQuery to perform the extraction and performance seems rather impressive. I've done the same thing with both eXistDB and xDB with no problem, indexing everything including the deep hierarchical structure. Once in the database, I should be able to update fragments within the document as easily as if these fragments were burst into individual files. Is there a technical reason (I've yet to discover) for why this would not be the case?
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
