Re: Dumping raw data in custom format

Andrew McFague Tue, 10 Feb 2015 09:42:06 -0800

Forgot to mention--the data set size is around 1.6 billion documents.

On Tuesday, February 10, 2015 at 9:29:39 AM UTC-8, Andrew McFague wrote:
>
> I have a use case where I'd like to be able to dump *all* the documents in 
> ES to a specific output format.  However, using scan or any other 
> "consistent" view is relatively slow.  Using the scan query with a 
> "match_all", it processes items at a rate of around 80,000 a second--but 
> that means it will still take over 5 hours to dump.  It also means it can't 
> be parallelized across machines, which effectively stops scaling.
>
> I've also looked at things like Knapsack, Elastidump, etc., but these 
> still don't give me the ability to parallelize the work, and they're not 
> particularly fast.  They also don't allow me to manipulate it to the 
> specific format I want (it's not JSON, and requires some organization of 
> the data).
>
> So I have a few ideas, which may or may not be possible:
>
>    1. Retrieve shard-specific data from ElasticSearch (i.e., "Give me all 
>    the data for Shard X").  This would allow me to divide the task up into 
> /at 
>    least/ S tasks, where S is the number of segments, but there doesn't seem 
>    to be an API that exposes this.
>    2. Get snapshots of each shard from disk.  This would also allow me to 
>    divide up the work, but would also require a framework on top to 
> coordinate 
>    which segments have been retrieved, etc..
>    3. Hadoop.  However, launching an entire MR cluster just to dump data 
>    sounds like overkill.
>
> The first option gives me the most flexibility and would require the least 
> amount of work on my part, but there doesn't seem to be any way to dump all 
> the data for a specific shard via the API.  Is there any sort of API or 
> flag that provides this, or otherwise provides a way to partition the data 
> to different consumers?
>
> The second would also (assumingly) give me the ability to subdivide tasks 
> out per worker, and would also allow these to be done offline.  I was able 
> to write a sample program that uses Lucene to do this, but this adds the 
> additional complexity of coordinating work across the various hosts in the 
> cluster, as well as requiring an intermediate step where I transfer the 
> common files to another host to combine them.  This isn't a terrible 
> problem to have--but does require additional infrastructure to organize.
>
> The third is not desirable because it's an incredible amount of 
> operational load without a clear tradeoff, since we don't already have a 
> map reduce cluster on hand.
>
> Thanks for any tips or suggestions!
>
> Andrew
>


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/91cebf19-dc58-48bf-80fa-839a7cea4596%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Dumping raw data in custom format

Reply via email to