Re: [MarkLogic Dev General] Inserting millions of small documents

Michael Blakeley Tue, 08 Dec 2009 08:56:52 -0800

Removing the unnecessary sub-fragments removes the need to constantlyrefer to the parent fragment during evaluation. Remember, XPath requiresthat we preserve document order - so if your XML specifies aparent-child structure, we have to honor it. Removing this superfluousparent-child layer frees up resources for more interesting work.

That "magic spot" is probably related to the point where the parentfragment and children no longer fit into your CPU's L2 or L3 cache size.Main memory is slower than on-die cache, and disk is even slower.

RecordLoader is designed to be somewhat extensible, BTW. You couldprobably include a buffering strategy as part of a Content orContentFactory interface implementation.


-- Mike

On 2009-12-08 06:26, Lee, David wrote:

Thanks for all the questions, I appreciate every one !
In fact I want to say thanks to everyone on this mailing list, its one of the 
most helpful I've been on for any difficult to learn product,
all your advise (and patience ! ) is greatly appreciated.

I'll answer a couple of mike's questions, but the rest will wait until after I 
have some experiments,
I'm modifying my own version of RecordLoader to do what I want (MarkLogic 
extension to xmlsh,  http://www.xmlsh.org/ModuleMarkLogic)
I'm modifying the "put" command to be able to batch up groups of files to send 
as a single transaction and that seems to be going much faster then recordloader.

As for the 300mil fragments, sorry that was a typo.  It was 3mil.
I just turned off the maintain-last-modified  flag , thanks for the suggestion.

Thanks to a volunteer from ML Tech support yesterday who sat with me on IM and 
live on my system for an hour to test things out,
I've come to the following conclusions.  I apologize for being "stubborn" but I 
dont like to make architecture decisions on vague data,
I like hard numbers and some rational I can sink my teeth into besides "thats what 
everyone else does".
But you all were right and I was wrong ... Sorta.

There seems to be a magic spot somewhere between 500,000 and 3.5 million 
fragments where, atleast on my server,  a single fragmented doc
searches very poorly.   I have a 500k fragmented doc that searches extremely 
fast, but my 3.5mil fragmented doc is 100x slower.
Even though its only about 2x bigger in total size.  I dont have a great answer 
(and neither did the tech rep) but some hints were given
that massive fragmentation was not optimized as well in ML as separate docs.   
This is completely against my, obviously wrong, presumptions about
how a database designed for large XML documents would (or should) behave but 
there it is.
I was also told by the tech rep that the Engineers optimize for the model of "a 
document is a row".
But OTOH ... until I hit that magic # fragments perform very well but hit  that 
brick wall and then I'm seeing 5+ sec search times.


The other result I was able to verify is that xpath optimization can be 
"fooled" by using declare instead of let !
I dont quite get this but it was very provable.


Example:

This query takes about 20 minutes to run, and analizing it shows it iteratively 
searches through every 3mil fragments:

-----------------
declare variable $id := '2483417';
declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
declare variable $id2  := $c/RXAUI/string();

for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
return $r
-------------

Wherase this query performs as fast as using cts:search (in my case 5 sec)
----------------

declare variable $id := '2483417';
declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
let $id2  := $c/RXAUI/string()
return
for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
return $r

-------------------


For some reason using declare fools the optimizer into not using indexes.  but 
let allows the indexes to be used.
Amazing but true !!!!!
( But if I change the declare $id2 to be something reasonably constant it uses 
indexes, like this:

==============

declare variable $id := '2483417';
declare variable $c := doc("/RxNorm/rxnconso.xml")/rxnconso/row[RXAUI eq $id];
declare variable $id2  := concat( '2483' , '417' );

for $r in   doc("/RxNorm/rxnsat.xml")/rxnsat/row[RXAUI eq $id2]
return $r

=========

Fast again.





-----Original Message-----
From: Michael Blakeley [mailto:[email protected]]
Sent: Tuesday, December 08, 2009 12:40 AM
To: General Mark Logic Developer Discussion
Cc: Lee, David
Subject: Re: [MarkLogic Dev General] Inserting millions of small documents

David,

Based on what you've said so far, I would not recommend assembling 3M
filesystem documents. But if you are using a single input file, I would
split that up (see below) and I might consider 3M zip entries (in
multiple archives, since Java has problems with more than 32768 entries
per zip). Oddly, it's often more efficient to load a lot of small
documents from zips than to load them directly from the filesystem. CPUs
are fast.

Given adequate CPU, memory, I/O, and the right configuration, I think
you should be able to get at least 500 inserts/sec with 300-B documents.
That would imply a 2-hr load time, but you might do better than that. If
you are only getting 83 docs/sec on good hardware, then I think there is
room for improvement.

Advance apologies for what will seem like hostile cross-examination....

Are you sure about "the same document (fragmented to 300 mil
fragments)"? Is it a typo, or did I miss the reason for a 100x
difference in fragment counts?

Do you have maintain-last-modified disabled? Is directory-creation set
to manual? Have you considered turning off any of the default full-text
indexes? How many forests does the database have?

When using RecordLoader, is the input a single XML file? If so, you
won't get much, if any, benefit from multiple CPUs on the client or the
server. I'd try to have 2 input files per server core, and tell
RecordLoader to run two threads per server core.

When using RecordLoader, which subsystem appears to be the bottleneck?
Is it on the client or on the server?

Note that RecordLoader has a "file loader" code path, which doesn't
attempt to parse the input files at all, and a "parser loader" code
path, which is designed to split up large files. Naturally they have
different performance characteristics. From a configuration perspective,
this is the difference between ID_NAME=#FILENAME and ID_NAME=foo - and
the former is the default. Generally speaking, it also performs better
than the "parser loader", but not always.

-- Mike

On 2009-12-07 17:40, Lee, David wrote:

I want to insert about 3 million 300 byteish docs to ML.
I tried using RecordLoader and it did the trick but took about 10 hours.
Inserting the same document (fragmented to 300 mil fragments) as 1 document 
using XCC directly takes about 1 hour.
Obviously things can be improved.

Any suggestions on what might be fastest ?
Suppose I have the 3 mil documents already split up in a directory on my local 
filesystem.

After talking to ML Tech support it was suggested that by doing loads in 
batches would be faster then one at a time.  Maybe I can do better then Record 
Loader.

Any suggestions using XCC on which would be faster  ?


1)      ContentLoader.load(  String[] , File[] )

2)      Session.insertContent( Content[] )


Another Idea I had was to split the 1G doc into say 1000  (instead of 3 mil) 
docs each containing  3000 elements then loading them
into ML (unfragmented) then runningn an xquery program on the server to create 
the final 3mili documents.



----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Inserting millions of small documents

Reply via email to