Use the scan/scroll API with different queries (filter by document type
etc), from a custom tool written in Java. This will be the fastest.

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Feb 10, 2015 at 7:41 PM, Andrew McFague <[email protected]> wrote:

> Forgot to mention--the data set size is around 1.6 billion documents.
>
> On Tuesday, February 10, 2015 at 9:29:39 AM UTC-8, Andrew McFague wrote:
>>
>> I have a use case where I'd like to be able to dump *all* the documents
>> in ES to a specific output format.  However, using scan or any other
>> "consistent" view is relatively slow.  Using the scan query with a
>> "match_all", it processes items at a rate of around 80,000 a second--but
>> that means it will still take over 5 hours to dump.  It also means it can't
>> be parallelized across machines, which effectively stops scaling.
>>
>> I've also looked at things like Knapsack, Elastidump, etc., but these
>> still don't give me the ability to parallelize the work, and they're not
>> particularly fast.  They also don't allow me to manipulate it to the
>> specific format I want (it's not JSON, and requires some organization of
>> the data).
>>
>> So I have a few ideas, which may or may not be possible:
>>
>>    1. Retrieve shard-specific data from ElasticSearch (i.e., "Give me
>>    all the data for Shard X").  This would allow me to divide the task up 
>> into
>>    /at least/ S tasks, where S is the number of segments, but there doesn't
>>    seem to be an API that exposes this.
>>    2. Get snapshots of each shard from disk.  This would also allow me
>>    to divide up the work, but would also require a framework on top to
>>    coordinate which segments have been retrieved, etc..
>>    3. Hadoop.  However, launching an entire MR cluster just to dump data
>>    sounds like overkill.
>>
>> The first option gives me the most flexibility and would require the
>> least amount of work on my part, but there doesn't seem to be any way to
>> dump all the data for a specific shard via the API.  Is there any sort of
>> API or flag that provides this, or otherwise provides a way to partition
>> the data to different consumers?
>>
>> The second would also (assumingly) give me the ability to subdivide tasks
>> out per worker, and would also allow these to be done offline.  I was able
>> to write a sample program that uses Lucene to do this, but this adds the
>> additional complexity of coordinating work across the various hosts in the
>> cluster, as well as requiring an intermediate step where I transfer the
>> common files to another host to combine them.  This isn't a terrible
>> problem to have--but does require additional infrastructure to organize.
>>
>> The third is not desirable because it's an incredible amount of
>> operational load without a clear tradeoff, since we don't already have a
>> map reduce cluster on hand.
>>
>> Thanks for any tips or suggestions!
>>
>> Andrew
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/91cebf19-dc58-48bf-80fa-839a7cea4596%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/91cebf19-dc58-48bf-80fa-839a7cea4596%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zv9-%3DEsiY1DpzjT8SzQ8jSg7rYrH04UPqYHpwOq2nyMOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to