> On Jan 18, 2015, at 9:14 PM, Michael Blakeley <[email protected]> wrote:

Thanks for the reply.

> Adding fragment rules makes sense if and only if you have large documents 
> with a number of elements that form conceptually equivalent sub-documents. 
> This works when the document acts something like a table, and for whatever 
> reason you don't want to split it on ingestion. So you create virtual 
> sub-documents: not as good as true documents, but good enough — and ideal for 
> certain situations. From what I understand you aren't in any of those 
> situations. Each of your documents is large, but there's no conceptually 
> useful sub-document structure. 
> 
> All is not lost: MarkLogic should still be able to do the job. I've worked 
> with a database over 7-TB in size with a significant number of large 
> documents, some well above 50-MB.
> 
> In a situation like that you have to be careful with your queries. Unfiltered 
> search and lexicon accessors don't much care how large your documents are: 
> use them wherever possible. Avoid returning large result sets: if that means 
> you have to cap the page size for search results, do it. You might be able to 
> arrange things so that you can display search results and other query reports 
> entirely from some mix of range indexes and properties, without touching the 
> documents themselves.
> 
> Maybe you could write up one of "can't really do anything" use cases, and ask 
> us how to solve it? You might get some useful ideas, and you could repeat 
> that with other use-cases until you feel comfortable with the techniques.

Sure.  I started off just creating a new flow and trying to load the documents 
into the Documents database with all the default settings. It loaded some 
documents but blew up with XDMP-FRAGTOOLARGE errors.  I could mitigate this 
somewhat by setting the transaction size down from 50 documents to 10 
documents. It loaded quite a few more documents before failing but still 
failed.  I changed the "in-memory tree size" on the Documents database from its 
default 32MB to 64MB but didn't notice that it made any difference.

I did have about 90% of my documents loaded and decided to give up on loading 
the rest for the moment and see if I could query anything.  A basic 
search:search query on a word that I knew occurred only a handful of times in 
the corpus worked, but a query for a slightly less rare word failed with 
"expanded tree cache full."

I next built my own query that I knew would return only one result.  This 
selects the title out of the header of a single document:

-----
xquery version "1.0-ml";
declare namespace  tei="http://www.tei-c.org/ns/1.0";;

for $doc in doc("/content/mldocs/A09134.xml")
return $doc/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title
-----

That worked, but changing that to return one result for each document in the 
collection (about 500 results):

-----
for $doc in collection("/tickets/ticket/16669535610111738813")
return $doc/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title
-----

once again failed with an "expanded tree cache full" error.

Only after adding a fragment root could I even load all the documents in the 
database or run a simple query that returns one result per document.  But I 
still run into tree cache full errors with some regularity.  I can imagine 
taking steps to limit the number of results, but the limit would need to be 
something like 10,000, not 10 or 100.  I was hoping regularly-sized fragments 
might be the key to predicting when simple queries are going to step off a 
cliff into undefined behavior.  If it's not, I don't know what is.

________________________________________
Craig A. Berry
mailto:[email protected]

"... getting out of a sonnet is much more
 difficult than getting in."
                 Brad Leithauser

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to