Forgot to mention--the data set size is around 1.6 billion documents. On Tuesday, February 10, 2015 at 9:29:39 AM UTC-8, Andrew McFague wrote: > > I have a use case where I'd like to be able to dump *all* the documents in > ES to a specific output format. However, using scan or any other > "consistent" view is relatively slow. Using the scan query with a > "match_all", it processes items at a rate of around 80,000 a second--but > that means it will still take over 5 hours to dump. It also means it can't > be parallelized across machines, which effectively stops scaling. > > I've also looked at things like Knapsack, Elastidump, etc., but these > still don't give me the ability to parallelize the work, and they're not > particularly fast. They also don't allow me to manipulate it to the > specific format I want (it's not JSON, and requires some organization of > the data). > > So I have a few ideas, which may or may not be possible: > > 1. Retrieve shard-specific data from ElasticSearch (i.e., "Give me all > the data for Shard X"). This would allow me to divide the task up into > /at > least/ S tasks, where S is the number of segments, but there doesn't seem > to be an API that exposes this. > 2. Get snapshots of each shard from disk. This would also allow me to > divide up the work, but would also require a framework on top to > coordinate > which segments have been retrieved, etc.. > 3. Hadoop. However, launching an entire MR cluster just to dump data > sounds like overkill. > > The first option gives me the most flexibility and would require the least > amount of work on my part, but there doesn't seem to be any way to dump all > the data for a specific shard via the API. Is there any sort of API or > flag that provides this, or otherwise provides a way to partition the data > to different consumers? > > The second would also (assumingly) give me the ability to subdivide tasks > out per worker, and would also allow these to be done offline. I was able > to write a sample program that uses Lucene to do this, but this adds the > additional complexity of coordinating work across the various hosts in the > cluster, as well as requiring an intermediate step where I transfer the > common files to another host to combine them. This isn't a terrible > problem to have--but does require additional infrastructure to organize. > > The third is not desirable because it's an incredible amount of > operational load without a clear tradeoff, since we don't already have a > map reduce cluster on hand. > > Thanks for any tips or suggestions! > > Andrew >
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91cebf19-dc58-48bf-80fa-839a7cea4596%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
