I have a use case where I'd like to be able to dump *all* the documents in 
ES to a specific output format.  However, using scan or any other 
"consistent" view is relatively slow.  Using the scan query with a 
"match_all", it processes items at a rate of around 80,000 a second--but 
that means it will still take over 5 hours to dump.  It also means it can't 
be parallelized across machines, which effectively stops scaling.

I've also looked at things like Knapsack, Elastidump, etc., but these still 
don't give me the ability to parallelize the work, and they're not 
particularly fast.  They also don't allow me to manipulate it to the 
specific format I want (it's not JSON, and requires some organization of 
the data).

So I have a few ideas, which may or may not be possible:

   1. Retrieve shard-specific data from ElasticSearch (i.e., "Give me all 
   the data for Shard X").  This would allow me to divide the task up into /at 
   least/ S tasks, where S is the number of segments, but there doesn't seem 
   to be an API that exposes this.
   2. Get snapshots of each shard from disk.  This would also allow me to 
   divide up the work, but would also require a framework on top to coordinate 
   which segments have been retrieved, etc..
   3. Hadoop.  However, launching an entire MR cluster just to dump data 
   sounds like overkill.

The first option gives me the most flexibility and would require the least 
amount of work on my part, but there doesn't seem to be any way to dump all 
the data for a specific shard via the API.  Is there any sort of API or 
flag that provides this, or otherwise provides a way to partition the data 
to different consumers?

The second would also (assumingly) give me the ability to subdivide tasks 
out per worker, and would also allow these to be done offline.  I was able 
to write a sample program that uses Lucene to do this, but this adds the 
additional complexity of coordinating work across the various hosts in the 
cluster, as well as requiring an intermediate step where I transfer the 
common files to another host to combine them.  This isn't a terrible 
problem to have--but does require additional infrastructure to organize.

The third is not desirable because it's an incredible amount of operational 
load without a clear tradeoff, since we don't already have a map reduce 
cluster on hand.

Thanks for any tips or suggestions!

Andrew

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/97b93fb9-fa7b-4e82-922c-98e8fb48103b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to