This got me thinking. ML has some EC2 API's. it would be great if it had S3 API's as well. These are not trivial to implement using the http core methods so is an ideal candidate for a builtin function.
---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected]<mailto:[email protected]> 812-482-5224 From: [email protected] [mailto:[email protected]] On Behalf Of Lee, David Sent: Monday, September 19, 2011 2:45 PM To: General MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Loading large sets of small files I did not attempt to put the files individually to S3. My experience with S3 is that the latency of storing files is fairly high. Also for my use cases I would have had to either A) store the URL's for every single document (which would be nearly as large as the document themselves) B) Implement S3 protocol in ML ... something I dont want to do. So what I did was to pre-compute a signed URL of the S3 document for the zip file and store that in a small 'metadata' XML file. I find its a interesting, perhaps corner, use case where the metadata describing the documents can be as large as the documents themselves unless you chunk them up into something like a zip file. Originally I had stored this larger XML file (of 2000-10000 'records') in ML and used fragmentation rules, but I found as is often recommended, that having each have its own URI is useful. Eats up some space though as keeping a URI lexicon adds some overhead ... When a document is only 200 bytes ... and the URI is 60 ... some serious additional overhead occurs storing millions of these. I'd love to see ML optimized for this kind of use more ... For example the ability to store these 'record like' documents in a bigger document but fragment them and have a direct index into the fragment. (uri#fragment_id ? ) This is definitely bordering the line between relational type data and content type data. ---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected]<mailto:[email protected]> 812-482-5224 From: [email protected] [mailto:[email protected]] On Behalf Of Justin Makeig Sent: Monday, September 19, 2011 2:09 PM To: General MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Loading large sets of small files Are the files stored individually in S3? Did you try accessing them one-by-one from S3 with MarkLogic's HTTP client? I'd be curious about the performance characteristics of that relative to the two techniques you shared. For example, list the contents of a bucket, chunk the list into individual segments, and loop over the each segment in parallel. This is the technique that Information Studio uses to walk a filesystem, for example. I'd be curious if this technique could apply efficiently to something like S3 as well. Thanks for the info. Justin Justin Makeig Senior Product Manager MarkLogic Corporation [email protected]<mailto:[email protected]> www.marklogic.com<http://www.marklogic.com/> On Sep 19, 2011, at 10:47 AM, Lee, David wrote: I wanted to share my experience trying several techniques of loading large sets of small files to MarkLogic. My use case is loading many (100k+) very small XML files into ML. Each file is a 50-200 bytes typically. To make the job easier I've been batching them into chunks of up to 2000 files so I can incrementally load a batch of files on demand. (I'd like to load ALL of them but for various reasons beyond this discussion I'm only loading a single 'batch' at a time). I have the files stored in Amazon S3. To make life easier (and ideally more efficient) I experimented with several techniques. Ultimately ended up with 2 techniques that are very similar but have amazingly different performance. Due to the architecture of the app I want to be able to 'pull' these files from a ML app directly. When a request comes in to load the files, the ML app fetches them from Amazon (via a URL), unpacks them and does a document-insert. 1) Zip of many xml files Zip the xml files (up to 2000 ) into a single zip file. In ML, unzip the file and one by one extract them (by iterating over the manifest) and load them. 2) Zip of a single wrapped XML document. Wrap the xml files into a single big XML document with a root element. Zip that xml file. In ML unzip the file, then iterate over the children of the root, and insert each child as a seperate document. All this is running on Amazon so the network speed between ML and S3 is quite fast. I first started with #1 and it worked ... but would take quite a while. Fetching and loading a 2000 'record' zip and extracting it would take up a minute. Some performance analysis gave me a clue to try #2. For starters I was amazed to find that the zip of 2000 small documents didn't compress much. On reflection it makes sense as zip is zipping each document individually, they are so small that it doesn't work well. I'd rather use a tar/gz format but ML doesn't have native methods for that (only zip). So that's why I tried #2 The file size dropped by 10x and the speed dropped by 10x ! Performance traces showed that most of the time was the overhead of extracting a single file from the zip. Extracting 2000 small files took 10x longer then extracting one file (same total size of uncompressed data). And amazingly the overhead of having to parse that one big XML file then do an xpath on it to pull out all the children was minimal compared to the unzip overhead. Anyway just thought I'd share this in case anyone is hitting a similar issue. Unzipping a zip of lots of small files is horribly expensive compared to unzipping a single big file. ---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected]<mailto:[email protected]> 812-482-5224 _______________________________________________ General mailing list [email protected]<mailto:[email protected]> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
