Re: [MarkLogic Dev General] Loading large sets of small files

Lee, David Mon, 19 Sep 2011 17:00:51 -0700

This got me thinking.
ML has some EC2 API's.  it would be great if it had S3 API's as well.
These are not trivial to implement using the http core methods so is an ideal 
candidate for a builtin function.




----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

From: [email protected] 
[mailto:[email protected]] On Behalf Of Lee, David
Sent: Monday, September 19, 2011 2:45 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Loading large sets of small files

I did not attempt to put the files individually to S3.
My experience with S3 is that the latency of storing files is fairly high.  
Also for my use cases I would have had to either

A) store the URL's for every single document (which would be nearly as large as 
the document themselves)
B) Implement S3 protocol in ML ... something I dont want to do.

So what I did was to pre-compute a signed URL of the S3 document for the zip 
file and store that in a small 'metadata' XML file.

I find its a interesting, perhaps corner, use case where the metadata 
describing the documents can be as large as the documents themselves unless you 
chunk them up into something like a zip file.
Originally I had stored this larger XML file (of 2000-10000 'records') in ML 
and used fragmentation rules, but I found as is often recommended, that having 
each have its own URI is useful.
Eats up some space though as keeping a URI lexicon adds some overhead ...
When a document is only 200 bytes ... and the URI is 60 ... some serious 
additional overhead occurs storing millions of these.   I'd love to see ML 
optimized for this kind of use more  ... For example the ability to store these 
'record like' documents in a bigger document but fragment them and have a 
direct index into the fragment.  (uri#fragment_id ? )  This is definitely 
bordering the line between relational type data and content type data.




----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

From: [email protected] 
[mailto:[email protected]] On Behalf Of Justin Makeig
Sent: Monday, September 19, 2011 2:09 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Loading large sets of small files

Are the files stored individually in S3? Did you try accessing them one-by-one 
from S3 with MarkLogic's HTTP client? I'd be curious about the performance 
characteristics of that relative to the two techniques you shared. For example, 
list the contents of a bucket, chunk the list into individual segments, and 
loop over the each segment in parallel. This is the technique that Information 
Studio uses to walk a filesystem, for example. I'd be curious if this technique 
could apply efficiently to something like S3 as well. Thanks for the info.

Justin

Justin Makeig
Senior Product Manager
MarkLogic Corporation
[email protected]<mailto:[email protected]>
www.marklogic.com<http://www.marklogic.com/>

On Sep 19, 2011, at 10:47 AM, Lee, David wrote:

I wanted to share my experience trying several techniques of loading large sets 
of small files to MarkLogic.
My use case is loading many (100k+) very small XML files into ML.
Each file is a 50-200 bytes typically.  To make the job easier I've been 
batching them into chunks of up to 2000 files  so I can incrementally load a 
batch of files on demand.
(I'd like to load ALL of them but for various reasons beyond this discussion 
I'm only loading a single 'batch' at a time).

I have the files stored in Amazon S3.   To make life easier (and ideally more 
efficient) I experimented with several techniques.    Ultimately ended up with 
2 techniques that are very similar but have amazingly different performance.   
Due to the architecture of the app I want to be able to 'pull' these files from 
a ML app directly.

When a request comes in to load the files, the ML app fetches them from Amazon 
(via a URL), unpacks them and does a document-insert.

1)  Zip of many xml files
Zip the xml files (up to 2000 ) into a single zip file.
In ML, unzip the file and one by one extract them (by iterating over the 
manifest) and load them.

2) Zip of a single wrapped XML document.
Wrap the xml files into a single big XML document with a root element.
Zip that xml file.
In ML unzip the file, then iterate over the children of the root, and insert 
each child as a seperate document.

All this is running on Amazon so the network speed between ML and S3 is quite 
fast.


I first started with #1 and it worked ... but would take quite a while.
Fetching and loading a 2000 'record' zip and extracting it would take up a 
minute.
Some performance analysis gave me a clue to try #2.
For starters I was amazed to find that the zip of 2000 small documents didn't 
compress much.
On reflection it makes sense as zip is zipping each document individually, they 
are so small that it doesn't work well.   I'd rather use a tar/gz format but ML 
doesn't have native methods for that (only zip).
So that's why I tried #2

The file size dropped by 10x and the speed dropped by 10x !
Performance traces showed that most of the time was the overhead of extracting 
a single file from the zip.   Extracting 2000 small files took 10x longer then 
extracting one file (same total size of uncompressed data).    And amazingly 
the overhead of having to parse that one big XML file then do an xpath on it to 
pull out all the children was minimal compared to the unzip overhead.

Anyway just thought I'd share this in case anyone is hitting a similar issue.
Unzipping a zip of lots of small files is horribly expensive compared to 
unzipping a single big file.








----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Loading large sets of small files

Reply via email to