I have a use case where I'd like to be able to dump *all* the documents in ES to a specific output format. However, using scan or any other "consistent" view is relatively slow. Using the scan query with a "match_all", it processes items at a rate of around 80,000 a second--but that means it will still take over 5 hours to dump. It also means it can't be parallelized across machines, which effectively stops scaling.
I've also looked at things like Knapsack, Elastidump, etc., but these still don't give me the ability to parallelize the work, and they're not particularly fast. They also don't allow me to manipulate it to the specific format I want (it's not JSON, and requires some organization of the data). So I have a few ideas, which may or may not be possible: 1. Retrieve shard-specific data from ElasticSearch (i.e., "Give me all the data for Shard X"). This would allow me to divide the task up into /at least/ S tasks, where S is the number of segments, but there doesn't seem to be an API that exposes this. 2. Get snapshots of each shard from disk. This would also allow me to divide up the work, but would also require a framework on top to coordinate which segments have been retrieved, etc.. 3. Hadoop. However, launching an entire MR cluster just to dump data sounds like overkill. The first option gives me the most flexibility and would require the least amount of work on my part, but there doesn't seem to be any way to dump all the data for a specific shard via the API. Is there any sort of API or flag that provides this, or otherwise provides a way to partition the data to different consumers? The second would also (assumingly) give me the ability to subdivide tasks out per worker, and would also allow these to be done offline. I was able to write a sample program that uses Lucene to do this, but this adds the additional complexity of coordinating work across the various hosts in the cluster, as well as requiring an intermediate step where I transfer the common files to another host to combine them. This isn't a terrible problem to have--but does require additional infrastructure to organize. The third is not desirable because it's an incredible amount of operational load without a clear tradeoff, since we don't already have a map reduce cluster on hand. Thanks for any tips or suggestions! Andrew -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/97b93fb9-fa7b-4e82-922c-98e8fb48103b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
