RE: [MarkLogic Dev General] Inserting millions of small documents

Lee, David Tue, 08 Dec 2009 06:26:48 -0800

Thanks for all the questions, I appreciate every one !
In fact I want to say thanks to everyone on this mailing list, its one of the 
most helpful I've been on for any difficult to learn product,
all your advise (and patience ! ) is greatly appreciated.

I'll answer a couple of mike's questions, but the rest will wait until after I 
have some experiments, 
I'm modifying my own version of RecordLoader to do what I want (MarkLogic 
extension to xmlsh,  http://www.xmlsh.org/ModuleMarkLogic)
I'm modifying the "put" command to be able to batch up groups of files to send 
as a single transaction and that seems to be going much faster then 
recordloader.

As for the 300mil fragments, sorry that was a typo.  It was 3mil.  
I just turned off the maintain-last-modified  flag , thanks for the suggestion.

Thanks to a volunteer from ML Tech support yesterday who sat with me on IM and 
live on my system for an hour to test things out,
I've come to the following conclusions.  I apologize for being "stubborn" but I 
dont like to make architecture decisions on vague data,
I like hard numbers and some rational I can sink my teeth into besides "thats 
what everyone else does".
But you all were right and I was wrong ... Sorta.

There seems to be a magic spot somewhere between 500,000 and 3.5 million 
fragments where, atleast on my server,  a single fragmented doc 
searches very poorly.   I have a 500k fragmented doc that searches extremely 
fast, but my 3.5mil fragmented doc is 100x slower.
Even though its only about 2x bigger in total size.  I dont have a great answer 
(and neither did the tech rep) but some hints were given 
that massive fragmentation was not optimized as well in ML as separate docs.   
This is completely against my, obviously wrong, presumptions about 
how a database designed for large XML documents would (or should) behave but 
there it is.
I was also told by the tech rep that the Engineers optimize for the model of "a 
document is a row".  
But OTOH ... until I hit that magic # fragments perform very well but hit  that 
brick wall and then I'm seeing 5+ sec search times.

The other result I was able to verify is that xpath optimization can be 
"fooled" by using declare instead of let !
I dont quite get this but it was very provable.

Example:

This query takes about 20 minutes to run, and analizing it shows it iteratively 
searches through every 3mil fragments:

-----------------
declare variable $id := '2483417';
declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
declare variable $id2  := $c/RXAUI/string();

for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
return $r
-------------

Wherase this query performs as fast as using cts:search (in my case 5 sec)
----------------

declare variable $id := '2483417';
declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
let $id2  := $c/RXAUI/string()
return
for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
return $r

-------------------

For some reason using declare fools the optimizer into not using indexes.  but 
let allows the indexes to be used.
Amazing but true !!!!!
( But if I change the declare $id2 to be something reasonably constant it uses 
indexes, like this:

==============

declare variable $id := '2483417';
declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
declare variable $id2  := concat( '2483' , '417' );

for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
return $r

=========

Fast again.

-----Original Message-----
From: Michael Blakeley [mailto:[email protected]] 
Sent: Tuesday, December 08, 2009 12:40 AM
To: General Mark Logic Developer Discussion
Cc: Lee, David
Subject: Re: [MarkLogic Dev General] Inserting millions of small documents

David,

Based on what you've said so far, I would not recommend assembling 3M 
filesystem documents. But if you are using a single input file, I would 
split that up (see below) and I might consider 3M zip entries (in 
multiple archives, since Java has problems with more than 32768 entries 
per zip). Oddly, it's often more efficient to load a lot of small 
documents from zips than to load them directly from the filesystem. CPUs 
are fast.

Given adequate CPU, memory, I/O, and the right configuration, I think 
you should be able to get at least 500 inserts/sec with 300-B documents. 
That would imply a 2-hr load time, but you might do better than that. If 
you are only getting 83 docs/sec on good hardware, then I think there is 
room for improvement.

Advance apologies for what will seem like hostile cross-examination....

Are you sure about "the same document (fragmented to 300 mil 
fragments)"? Is it a typo, or did I miss the reason for a 100x 
difference in fragment counts?

Do you have maintain-last-modified disabled? Is directory-creation set 
to manual? Have you considered turning off any of the default full-text 
indexes? How many forests does the database have?

When using RecordLoader, is the input a single XML file? If so, you 
won't get much, if any, benefit from multiple CPUs on the client or the 
server. I'd try to have 2 input files per server core, and tell 
RecordLoader to run two threads per server core.

When using RecordLoader, which subsystem appears to be the bottleneck? 
Is it on the client or on the server?

Note that RecordLoader has a "file loader" code path, which doesn't 
attempt to parse the input files at all, and a "parser loader" code 
path, which is designed to split up large files. Naturally they have 
different performance characteristics. From a configuration perspective, 
this is the difference between ID_NAME=#FILENAME and ID_NAME=foo - and 
the former is the default. Generally speaking, it also performs better 
than the "parser loader", but not always.

-- Mike

On 2009-12-07 17:40, Lee, David wrote:
> I want to insert about 3 million 300 byteish docs to ML.
> I tried using RecordLoader and it did the trick but took about 10 hours.
> Inserting the same document (fragmented to 300 mil fragments) as 1 document 
> using XCC directly takes about 1 hour.
> Obviously things can be improved.
>
> Any suggestions on what might be fastest ?
> Suppose I have the 3 mil documents already split up in a directory on my 
> local filesystem.
>
> After talking to ML Tech support it was suggested that by doing loads in 
> batches would be faster then one at a time.  Maybe I can do better then 
> Record Loader.
>
> Any suggestions using XCC on which would be faster  ?
>
>
> 1)      ContentLoader.load(  String[] , File[] )
>
> 2)      Session.insertContent( Content[] )
>
>
> Another Idea I had was to split the 1G doc into say 1000  (instead of 3 mil) 
> docs each containing  3000 elements then loading them
> into ML (unfragmented) then runningn an xquery program on the server to 
> create the final 3mili documents.
>
>
>
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> [email protected]<mailto:[email protected]>
> 812-482-5224
>
>
>

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Inserting millions of small documents

Reply via email to