I'd suggest asking about one problem at a time, but here are some thoughts.

With large documents I wouldn't use Info Studio. That may be controversial, but 
my experience is that mlcp or RecordLoader would be more efficient. You might 
still run into XDMP-FRAGTOOLARGE with any batch size above 1. With large 
documents I don't think batching will help much anyway, so I wouldn't bother. 
At size=1, any XDMP-FRAGTOOLARGE messages should tell you which buffer to 
increase. Read the full error message, and post a copy if you have questions 
about what it's telling you.

Once the documents are loaded, it's time to write queries. As you work on 
queries you may run into XDMP-EXPNTREECACHEFULL. That means the server doesn't 
have enough memory for the working set, so the query can't complete. You can 
fix that in two ways: reduce the size of the working set by tuning the query, 
or increase the expanded tree cache size in the group configuration. It's 
usually better to tune the query, but tuning the configuration is also a valid 
solution in some situations.

When might you tune the expanded tree cache size? Basically when it's the only 
remaining option: see below for other ideas to try first. Figure the expanded 
tree cache size will need to be about 3x the XML byte size of the working set. 
So if you need to extract titles from 1,000 documents averaging 8-MiB each, and 
you can't stream or use any of the other strategies below, you might need 
around 24-GiB. Note that the expanded tree cache size is limited to 32-GiB per 
host, placing a cap on this approach. Also keep in mind that the host still 
needs room for other stuff in memory. As a rule of thumb I tune the total group 
cache sizes no larger than 1/3 to 1/2 of total RAM. Because of these factors 
it's often better to use query tuning strategies.

To start tuning the query, use predicate limits and avoiding FLWOR expressions. 
If possible write your queries so that they can stream, minimizing the working 
memory used. There are some tips at 
http://stackoverflow.com/questions/14679746/avoiding-xdmp-expntreecachefull-and-loading-document
 and http://blakeley.com/blogofile/2012/03/19/let-free-style-and-streaming/

    (collection()/tei:TEI/tei:teiHeader/
       tei:fileDesc/tei:titleStmt/tei:title)[1 to 10]

This omits the FLWOR, which wasn't doing anything and tends to break streaming. 
Note that developer tools also tend to break streaming, so you may have to 
develop the more sensitive queries with direct HTTP requests rather than cq, 
qconsole, or eclipse. I'd avoid the REST API too, at least for now. Stick to 
simple .xqy files served directly by an HTTPServer. 

Another trick is to avoid loading the entire document by reading directly from 
range indexes. These are basically column indexes: create an element range 
index on 'tei:title', and all the values of that element go into a sorted value 
index. Then you can retrieve all titles without going through the expanded tree 
cache:

   cts:element-values(xs:QName('tei:title'))

You can intersect that with a query and do other tricks: see 
https://docs.marklogic.com/guide/search-dev/lexicon for more on that. Note that 
adding new element-range indexes will trigger reindexing, which may hit 
XDMP-FRAGTOOLARGE if that hasn't been tuned. I don't think the reindexer batch 
size can be tuned, so it may be easier to reingest.

Another trick is to copy some element values into properties at ingestion time. 
Properties are stored in their own fragments, one per document. So if you keep 
the properties small, those fragments will be small. With RecordLoader you 
could use 
CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.http.HttpContentFactory to 
implement this. You'd copy specific elements like tei:title into properties 
using xdmp:document-set-properties() or a related function, while still 
inserting the entire document. Then when you want something from that set of 
known elements, you can retrieve it using xdmp:document-properties() or a 
related function, without touching the main document.

You can also create range indexes on property elements, combining those two 
strategies.

-- Mike

> On 19 Jan 2015, at 06:02 , Craig A. Berry <[email protected]> wrote:
> 
> 
>> On Jan 18, 2015, at 9:14 PM, Michael Blakeley <[email protected]> wrote:
> 
> Thanks for the reply.
> 
>> Adding fragment rules makes sense if and only if you have large documents 
>> with a number of elements that form conceptually equivalent sub-documents. 
>> This works when the document acts something like a table, and for whatever 
>> reason you don't want to split it on ingestion. So you create virtual 
>> sub-documents: not as good as true documents, but good enough — and ideal 
>> for certain situations. From what I understand you aren't in any of those 
>> situations. Each of your documents is large, but there's no conceptually 
>> useful sub-document structure. 
>> 
>> All is not lost: MarkLogic should still be able to do the job. I've worked 
>> with a database over 7-TB in size with a significant number of large 
>> documents, some well above 50-MB.
>> 
>> In a situation like that you have to be careful with your queries. 
>> Unfiltered search and lexicon accessors don't much care how large your 
>> documents are: use them wherever possible. Avoid returning large result 
>> sets: if that means you have to cap the page size for search results, do it. 
>> You might be able to arrange things so that you can display search results 
>> and other query reports entirely from some mix of range indexes and 
>> properties, without touching the documents themselves.
>> 
>> Maybe you could write up one of "can't really do anything" use cases, and 
>> ask us how to solve it? You might get some useful ideas, and you could 
>> repeat that with other use-cases until you feel comfortable with the 
>> techniques.
> 
> Sure.  I started off just creating a new flow and trying to load the 
> documents into the Documents database with all the default settings. It 
> loaded some documents but blew up with XDMP-FRAGTOOLARGE errors.  I could 
> mitigate this somewhat by setting the transaction size down from 50 documents 
> to 10 documents. It loaded quite a few more documents before failing but 
> still failed.  I changed the "in-memory tree size" on the Documents database 
> from its default 32MB to 64MB but didn't notice that it made any difference.
> 
> I did have about 90% of my documents loaded and decided to give up on loading 
> the rest for the moment and see if I could query anything.  A basic 
> search:search query on a word that I knew occurred only a handful of times in 
> the corpus worked, but a query for a slightly less rare word failed with 
> "expanded tree cache full."
> 
> I next built my own query that I knew would return only one result.  This 
> selects the title out of the header of a single document:
> 
> -----
> xquery version "1.0-ml";
> declare namespace  tei="http://www.tei-c.org/ns/1.0";;
> 
> for $doc in doc("/content/mldocs/A09134.xml")
> return $doc/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title
> -----
> 
> That worked, but changing that to return one result for each document in the 
> collection (about 500 results):
> 
> -----
> for $doc in collection("/tickets/ticket/16669535610111738813")
> return $doc/tei:TEI/tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title
> -----
> 
> once again failed with an "expanded tree cache full" error.
> 
> Only after adding a fragment root could I even load all the documents in the 
> database or run a simple query that returns one result per document.  But I 
> still run into tree cache full errors with some regularity.  I can imagine 
> taking steps to limit the number of results, but the limit would need to be 
> something like 10,000, not 10 or 100.  I was hoping regularly-sized fragments 
> might be the key to predicting when simple queries are going to step off a 
> cliff into undefined behavior.  If it's not, I don't know what is.
> 
> ________________________________________
> Craig A. Berry
> mailto:[email protected]
> 
> "... getting out of a sonnet is much more
> difficult than getting in."
>                 Brad Leithauser
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to