RE: [MarkLogic Dev General] Inserting millions of small documents

Lee, David Tue, 08 Dec 2009 15:41:33 -0800

Thanks Mike, this makes some sense.
Well ... My experiments to efficiently upload millions of tiny docs has been a 
total failure.
Not only didnt batch uploading increase the speed over about 20 docs/sec ... 
but after I killed it off 5 hours later (only about 1/4 done),
I tried to DELETE the 900k files it had created ... 5 hours later I'm still 
trying to delete them ...  Deleting them seems to take longer then creating 
them !
I've had just as bad experience doing a directory-delete as a for loop with 
document-delete ... Slooooooooowwwwww
I guess I understand because not only is it deleting the files but its got to 
update all the indexes and terms that had been created when they were uploaded 
... 
Still makes me wish for a quick delete ... Maybe this is what the real value of 
Forests is ??  I could probably delete a forest in a second.
If I have say 10 "kinds" of data does it make sense to stick them in separate 
"Forests" ?   Will the indexes be logically joined when I put them all in the 
same DB ?
ML docs talk about Forests for managing spanning disks swapping data and such 
but maybe this is a good use case.



SO I'm experimenting with a hybrid solution.
I re-grouped my XML files so that I have a directory of XML files each 
containing 1000 "records" each.
I was able to upload these to the ML server quite quickly.   I'm now applying 
fragmentation rules on these records (which is going slow because the delete of 
the remaining 500k 
tiny documents is hogging the system ) ... 

But maybe this hybrid solution might work as a compromise. ... we shall see ... 
ON a new day :) (tomorrow).


-David






-----Original Message-----
From: Michael Blakeley [mailto:[email protected]] 
Sent: Tuesday, December 08, 2009 11:57 AM
To: Lee, David
Cc: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Inserting millions of small documents

Removing the unnecessary sub-fragments removes the need to constantly 
refer to the parent fragment during evaluation. Remember, XPath requires 
that we preserve document order - so if your XML specifies a 
parent-child structure, we have to honor it. Removing this superfluous 
parent-child layer frees up resources for more interesting work.

That "magic spot" is probably related to the point where the parent 
fragment and children no longer fit into your CPU's L2 or L3 cache size. 
Main memory is slower than on-die cache, and disk is even slower.

RecordLoader is designed to be somewhat extensible, BTW. You could 
probably include a buffering strategy as part of a Content or 
ContentFactory interface implementation.

-- Mike

On 2009-12-08 06:26, Lee, David wrote:
> Thanks for all the questions, I appreciate every one !
> In fact I want to say thanks to everyone on this mailing list, its one of the 
> most helpful I've been on for any difficult to learn product,
> all your advise (and patience ! ) is greatly appreciated.
>
> I'll answer a couple of mike's questions, but the rest will wait until after 
> I have some experiments,
> I'm modifying my own version of RecordLoader to do what I want (MarkLogic 
> extension to xmlsh,  http://www.xmlsh.org/ModuleMarkLogic)
> I'm modifying the "put" command to be able to batch up groups of files to 
> send as a single transaction and that seems to be going much faster then 
> recordloader.
>
> As for the 300mil fragments, sorry that was a typo.  It was 3mil.
> I just turned off the maintain-last-modified  flag , thanks for the 
> suggestion.
>
> Thanks to a volunteer from ML Tech support yesterday who sat with me on IM 
> and live on my system for an hour to test things out,
> I've come to the following conclusions.  I apologize for being "stubborn" but 
> I dont like to make architecture decisions on vague data,
> I like hard numbers and some rational I can sink my teeth into besides "thats 
> what everyone else does".
> But you all were right and I was wrong ... Sorta.
>
> There seems to be a magic spot somewhere between 500,000 and 3.5 million 
> fragments where, atleast on my server,  a single fragmented doc
> searches very poorly.   I have a 500k fragmented doc that searches extremely 
> fast, but my 3.5mil fragmented doc is 100x slower.
> Even though its only about 2x bigger in total size.  I dont have a great 
> answer (and neither did the tech rep) but some hints were given
> that massive fragmentation was not optimized as well in ML as separate docs.  
>  This is completely against my, obviously wrong, presumptions about
> how a database designed for large XML documents would (or should) behave but 
> there it is.
> I was also told by the tech rep that the Engineers optimize for the model of 
> "a document is a row".
> But OTOH ... until I hit that magic # fragments perform very well but hit  
> that brick wall and then I'm seeing 5+ sec search times.
>
>
> The other result I was able to verify is that xpath optimization can be 
> "fooled" by using declare instead of let !
> I dont quite get this but it was very provable.
>
>
> Example:
>
> This query takes about 20 minutes to run, and analizing it shows it 
> iteratively searches through every 3mil fragments:
>
> -----------------
> declare variable $id := '2483417';
> declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
> declare variable $id2  := $c/RXAUI/string();
>
> for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
> return $r
> -------------
>
> Wherase this query performs as fast as using cts:search (in my case 5 sec)
> ----------------
>
> declare variable $id := '2483417';
> declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
> let $id2  := $c/RXAUI/string()
> return
> for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
> return $r
>
> -------------------
>
>
> For some reason using declare fools the optimizer into not using indexes.  
> but let allows the indexes to be used.
> Amazing but true !!!!!
> ( But if I change the declare $id2 to be something reasonably constant it 
> uses indexes, like this:
>
> ==============
>
> declare variable $id := '2483417';
> declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
> declare variable $id2  := concat( '2483' , '417' );
>
> for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
> return $r
>
> =========
>
> Fast again.
>
>
>
>
>
> -----Original Message-----
> From: Michael Blakeley [mailto:[email protected]]
> Sent: Tuesday, December 08, 2009 12:40 AM
> To: General Mark Logic Developer Discussion
> Cc: Lee, David
> Subject: Re: [MarkLogic Dev General] Inserting millions of small documents
>
> David,
>
> Based on what you've said so far, I would not recommend assembling 3M
> filesystem documents. But if you are using a single input file, I would
> split that up (see below) and I might consider 3M zip entries (in
> multiple archives, since Java has problems with more than 32768 entries
> per zip). Oddly, it's often more efficient to load a lot of small
> documents from zips than to load them directly from the filesystem. CPUs
> are fast.
>
> Given adequate CPU, memory, I/O, and the right configuration, I think
> you should be able to get at least 500 inserts/sec with 300-B documents.
> That would imply a 2-hr load time, but you might do better than that. If
> you are only getting 83 docs/sec on good hardware, then I think there is
> room for improvement.
>
> Advance apologies for what will seem like hostile cross-examination....
>
> Are you sure about "the same document (fragmented to 300 mil
> fragments)"? Is it a typo, or did I miss the reason for a 100x
> difference in fragment counts?
>
> Do you have maintain-last-modified disabled? Is directory-creation set
> to manual? Have you considered turning off any of the default full-text
> indexes? How many forests does the database have?
>
> When using RecordLoader, is the input a single XML file? If so, you
> won't get much, if any, benefit from multiple CPUs on the client or the
> server. I'd try to have 2 input files per server core, and tell
> RecordLoader to run two threads per server core.
>
> When using RecordLoader, which subsystem appears to be the bottleneck?
> Is it on the client or on the server?
>
> Note that RecordLoader has a "file loader" code path, which doesn't
> attempt to parse the input files at all, and a "parser loader" code
> path, which is designed to split up large files. Naturally they have
> different performance characteristics. From a configuration perspective,
> this is the difference between ID_NAME=#FILENAME and ID_NAME=foo - and
> the former is the default. Generally speaking, it also performs better
> than the "parser loader", but not always.
>
> -- Mike
>
> On 2009-12-07 17:40, Lee, David wrote:
>> I want to insert about 3 million 300 byteish docs to ML.
>> I tried using RecordLoader and it did the trick but took about 10 hours.
>> Inserting the same document (fragmented to 300 mil fragments) as 1 document 
>> using XCC directly takes about 1 hour.
>> Obviously things can be improved.
>>
>> Any suggestions on what might be fastest ?
>> Suppose I have the 3 mil documents already split up in a directory on my 
>> local filesystem.
>>
>> After talking to ML Tech support it was suggested that by doing loads in 
>> batches would be faster then one at a time.  Maybe I can do better then 
>> Record Loader.
>>
>> Any suggestions using XCC on which would be faster  ?
>>
>>
>> 1)      ContentLoader.load(  String[] , File[] )
>>
>> 2)      Session.insertContent( Content[] )
>>
>>
>> Another Idea I had was to split the 1G doc into say 1000  (instead of 3 mil) 
>> docs each containing  3000 elements then loading them
>> into ML (unfragmented) then runningn an xquery program on the server to 
>> create the final 3mili documents.
>>
>>
>>
>> ----------------------------------------
>> David A. Lee
>> Senior Principal Software Engineer
>> Epocrates, Inc.
>> [email protected]<mailto:[email protected]>
>> 812-482-5224
>>
>>
>>
>

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Inserting millions of small documents

Reply via email to