> On Jan 18, 2015, at 9:14 PM, Michael Blakeley <[email protected]> wrote:
Thanks for the reply. > Adding fragment rules makes sense if and only if you have large documents > with a number of elements that form conceptually equivalent sub-documents. > This works when the document acts something like a table, and for whatever > reason you don't want to split it on ingestion. So you create virtual > sub-documents: not as good as true documents, but good enough — and ideal for > certain situations. From what I understand you aren't in any of those > situations. Each of your documents is large, but there's no conceptually > useful sub-document structure. > > All is not lost: MarkLogic should still be able to do the job. I've worked > with a database over 7-TB in size with a significant number of large > documents, some well above 50-MB. > > In a situation like that you have to be careful with your queries. Unfiltered > search and lexicon accessors don't much care how large your documents are: > use them wherever possible. Avoid returning large result sets: if that means > you have to cap the page size for search results, do it. You might be able to > arrange things so that you can display search results and other query reports > entirely from some mix of range indexes and properties, without touching the > documents themselves. > > Maybe you could write up one of "can't really do anything" use cases, and ask > us how to solve it? You might get some useful ideas, and you could repeat > that with other use-cases until you feel comfortable with the techniques. Sure. I started off just creating a new flow and trying to load the documents into the Documents database with all the default settings. It loaded some documents but blew up with XDMP-FRAGTOOLARGE errors. I could mitigate this somewhat by setting the transaction size down from 50 documents to 10 documents. It loaded quite a few more documents before failing but still failed. I changed the "in-memory tree size" on the Documents database from its default 32MB to 64MB but didn't notice that it made any difference. I did have about 90% of my documents loaded and decided to give up on loading the rest for the moment and see if I could query anything. A basic search:search query on a word that I knew occurred only a handful of times in the corpus worked, but a query for a slightly less rare word failed with "expanded tree cache full." I next built my own query that I knew would return only one result. This selects the title out of the header of a single document: ----- xquery version "1.0-ml"; declare namespace tei="http://www.tei-c.org/ns/1.0"; for $doc in doc("/content/mldocs/A09134.xml") return $doc/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title ----- That worked, but changing that to return one result for each document in the collection (about 500 results): ----- for $doc in collection("/tickets/ticket/16669535610111738813") return $doc/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title ----- once again failed with an "expanded tree cache full" error. Only after adding a fragment root could I even load all the documents in the database or run a simple query that returns one result per document. But I still run into tree cache full errors with some regularity. I can imagine taking steps to limit the number of results, but the limit would need to be something like 10,000, not 10 or 100. I was hoping regularly-sized fragments might be the key to predicting when simple queries are going to step off a cliff into undefined behavior. If it's not, I don't know what is. ________________________________________ Craig A. Berry mailto:[email protected] "... getting out of a sonnet is much more difficult than getting in." Brad Leithauser _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
