Re: How to generate ES index in the hadoop

Costin Leau Wed, 26 Feb 2014 13:58:37 -0800

On 2/26/2014 11:26 PM, drew dahlke wrote:

Hi Costin,


We're very interested offline processing as well. To draw a parallel to HBase, 
you could write a hadoop job that writes
out to a table over the thrift API. However if you're going to load in many 
terrabytes of data, there's the option to
write out directly to the HTable file format and bulk load the file into your 
cluster once generated. Bulk loading was
orders of magnitude faster than the HTTP based API.

Writing out lucene segments from a hadoop job is nothing new (check out the 
katta project
http://katta.sourceforge.net/). I saw that ES has snapshot backup/restore in 
the pipeline. It'd be fantastic if we could
write hadoop jobs that output data in the same format as ES backups and then 
use the restore functionality in ES to bulk
load the data directly without having to go through a REST API. I feel like 
that would be faster and it would provide
the flexibility to scale out the hadoop cluster independently of the ES cluster.

If you are concerned about indexing data directly into a live cluster, you could just have a different, staging onesetup (with a complete topology as well) to which you can index data. Then do a snapshot (potentially to HDFS) and thenload the data into your live one.

This is already supported - see this blog post [1].

Note that ES uses Lucene internally but the segments are just some part of its 
internal metadata.

Recreating an ES index directly into a job means hacking and reimplementing a lot of the distributed work that ES isalready doing for you without a lot (if any) performance gains:- each job/task would have its own ES instance. This means a 1:1 mapping between the job tasks and ES nodes which is awaste of resources.- each ES instance would rely on the running task/job machine. This can overload the hardware since you are forced toco-locate the two whether you want it or not.- at the end each ES instance would have to export its data somehow. Since each node only gets some chunk of the data,the indices would have to be aggregated.This implies significant I/O (since you are moving the same data multiple times) and at least twice the amount of diskspace. The network I/O gets significantly amplified when using HDFS for indexing since the disk is not local; with ESyou can chose to use HDFS or not (for best performance I would advise against that).

Consider here the allocation of data within ES shards/nodes (based on the cluster topology + user settings). For themost part, this will be similar to another reindexing.

The current approach of es-hadoop has none of these issues and all the benefits. You can scale Hadoop or ES independentof each other - your job can have 10s (or 100s in some cases) of tasks that are streaming data to an ES cluster of 5-10beefy nodes. You can start with ES co-located on the same physical machines as Hadoop and, as you grow move some or allthe nodes to a different setup.Since es-hadoop parallelizes _both_ reads and writes, the hadoop job gets full access to the ES cluster; the bigger thetarget index is, the more shards it can talk to in parallel.


Additionally, there's minimal I/O - we only move the data needed _once_ to ES.

If you have a performance problem caused by es-hadoop, I'd be happy to look at 
the numbers/stats.

Hope this helps,

[1] http://www.elasticsearch.org/blog/elasticsearch-hadoop-1-3-m2/


On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote:

    I'm not sure what you mean by "offline in Hadoop"...
    Indexing the data requires ES or you could try and replicate it manually 
but I would argue you'll end up duplicating
    the
    work done in ES.
    You could potentially setup a smaller (even one node) ES cluster just for 
indexing in parallel or collocated with your
    Hadoop cluster - you could use this to do the indexing and then copy the 
indexes over to the live cluster.
    That is, you'll have two ES clusters: one for staging/indexing and another 
one for live/read-only data...


    On 21/06/2013 9:27 PM, Jack Liu wrote:
    > Thanks Costin,
    >
    > I am afraid that I am not allowed to use it ( or any API), because of the 
cluster policy. What I am looking for is to
    > complete the indexing part entirely offline
    > in the hadoop, is it feasible though?
    >
    >
    >
    > On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote:
    >
    >     Have you looked at Elasticsearch-Hadoop [1] ? You can use it to 
stream data to/from ES to/from Hadoop.
    >
    >     [1]https://github.com/elasticsearch/elasticsearch-hadoop/ 
<https://github.com/elasticsearch/elasticsearch-hadoop/>
    <https://github.com/elasticsearch/elasticsearch-hadoop/ 
<https://github.com/elasticsearch/elasticsearch-hadoop/>>
    >
    >     On 21/06/2013 8:38 PM, Jack Liu wrote:
    >     > Hi all,
    >     >
    >     > I am new to ES, and we have large set of data need to be indexed 
into ES cluster daily (there is no delta available, we
    >     > only have 7~8 nodes).
    >     > I know use mapper function to directly call client api should be 
fine, however, our hadoop cluster policy does not allow
    >     > that.
    >     > So I am wondering if there is a way to just generate ES index in 
the hadoop, and then copy them into the cluster and ES
    >     > could pick them up when reloading.
    >     > Or could anyone point me to right place in the source code that is 
related to it.
    >     >
    >     > Any suggestion could be very helpful !
    >     >
    >     > Many thanks
    >     > Jack
    >     >
    >     > --
    >     > You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
    >     > To unsubscribe from this group and stop receiving emails from it, 
send an email to
    >     >[email protected] <javascript:>.
    >     > For more options, visithttps://groups.google.com/groups/opt_out 
<http://groups.google.com/groups/opt_out> <https://groups.google.com/groups/opt_out
    <https://groups.google.com/groups/opt_out>>.
    >     >
    >     >
    >
    >     --
    >     Costin
    >
    > --
    > You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
    > To unsubscribe from this group and stop receiving emails from it, send an 
email to
    >[email protected] <javascript:>.
    > For more options, visithttps://groups.google.com/groups/opt_out 
<https://groups.google.com/groups/opt_out>.
    >
    >

    --
    Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
[email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a32eb96-5c30-491a-a501-0a6950d1918f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/530E6353.3030803%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: How to generate ES index in the hadoop

Reply via email to