I wanted to share my experience trying several techniques of loading large sets 
of small files to MarkLogic.
My use case is loading many (100k+) very small XML files into ML.
Each file is a 50-200 bytes typically.  To make the job easier I've been 
batching them into chunks of up to 2000 files  so I can incrementally load a 
batch of files on demand.
(I'd like to load ALL of them but for various reasons beyond this discussion 
I'm only loading a single 'batch' at a time).

I have the files stored in Amazon S3.   To make life easier (and ideally more 
efficient) I experimented with several techniques.    Ultimately ended up with 
2 techniques that are very similar but have amazingly different performance.   
Due to the architecture of the app I want to be able to 'pull' these files from 
a ML app directly.

When a request comes in to load the files, the ML app fetches them from Amazon 
(via a URL), unpacks them and does a document-insert.

1)  Zip of many xml files
Zip the xml files (up to 2000 ) into a single zip file.
In ML, unzip the file and one by one extract them (by iterating over the 
manifest) and load them.

2) Zip of a single wrapped XML document.
Wrap the xml files into a single big XML document with a root element.
Zip that xml file.
In ML unzip the file, then iterate over the children of the root, and insert 
each child as a seperate document.

All this is running on Amazon so the network speed between ML and S3 is quite 
fast.


I first started with #1 and it worked ... but would take quite a while.
Fetching and loading a 2000 'record' zip and extracting it would take up a 
minute.
Some performance analysis gave me a clue to try #2.
For starters I was amazed to find that the zip of 2000 small documents didn't 
compress much.
On reflection it makes sense as zip is zipping each document individually, they 
are so small that it doesn't work well.   I'd rather use a tar/gz format but ML 
doesn't have native methods for that (only zip).
So that's why I tried #2

The file size dropped by 10x and the speed dropped by 10x !
Performance traces showed that most of the time was the overhead of extracting 
a single file from the zip.   Extracting 2000 small files took 10x longer then 
extracting one file (same total size of uncompressed data).    And amazingly 
the overhead of having to parse that one big XML file then do an xpath on it to 
pull out all the children was minimal compared to the unzip overhead.

Anyway just thought I'd share this in case anyone is hitting a similar issue.
Unzipping a zip of lots of small files is horribly expensive compared to 
unzipping a single big file.








----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]<mailto:[email protected]>
812-482-5224

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to