Use the scan/scroll API with different queries (filter by document type etc), from a custom tool written in Java. This will be the fastest.
-- Itamar Syn-Hershko http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member On Tue, Feb 10, 2015 at 7:41 PM, Andrew McFague <[email protected]> wrote: > Forgot to mention--the data set size is around 1.6 billion documents. > > On Tuesday, February 10, 2015 at 9:29:39 AM UTC-8, Andrew McFague wrote: >> >> I have a use case where I'd like to be able to dump *all* the documents >> in ES to a specific output format. However, using scan or any other >> "consistent" view is relatively slow. Using the scan query with a >> "match_all", it processes items at a rate of around 80,000 a second--but >> that means it will still take over 5 hours to dump. It also means it can't >> be parallelized across machines, which effectively stops scaling. >> >> I've also looked at things like Knapsack, Elastidump, etc., but these >> still don't give me the ability to parallelize the work, and they're not >> particularly fast. They also don't allow me to manipulate it to the >> specific format I want (it's not JSON, and requires some organization of >> the data). >> >> So I have a few ideas, which may or may not be possible: >> >> 1. Retrieve shard-specific data from ElasticSearch (i.e., "Give me >> all the data for Shard X"). This would allow me to divide the task up >> into >> /at least/ S tasks, where S is the number of segments, but there doesn't >> seem to be an API that exposes this. >> 2. Get snapshots of each shard from disk. This would also allow me >> to divide up the work, but would also require a framework on top to >> coordinate which segments have been retrieved, etc.. >> 3. Hadoop. However, launching an entire MR cluster just to dump data >> sounds like overkill. >> >> The first option gives me the most flexibility and would require the >> least amount of work on my part, but there doesn't seem to be any way to >> dump all the data for a specific shard via the API. Is there any sort of >> API or flag that provides this, or otherwise provides a way to partition >> the data to different consumers? >> >> The second would also (assumingly) give me the ability to subdivide tasks >> out per worker, and would also allow these to be done offline. I was able >> to write a sample program that uses Lucene to do this, but this adds the >> additional complexity of coordinating work across the various hosts in the >> cluster, as well as requiring an intermediate step where I transfer the >> common files to another host to combine them. This isn't a terrible >> problem to have--but does require additional infrastructure to organize. >> >> The third is not desirable because it's an incredible amount of >> operational load without a clear tradeoff, since we don't already have a >> map reduce cluster on hand. >> >> Thanks for any tips or suggestions! >> >> Andrew >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/91cebf19-dc58-48bf-80fa-839a7cea4596%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/91cebf19-dc58-48bf-80fa-839a7cea4596%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zv9-%3DEsiY1DpzjT8SzQ8jSg7rYrH04UPqYHpwOq2nyMOw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
