Re: How to generate ES index in the hadoop

drew dahlke Thu, 02 Oct 2014 07:31:16 -0700

Hey Costin,

Spinning up a dedicated cluster for the bulk loading seems like the right 
thing for us, but that's a big moving piece.


I think there might be a middle ground. I did a hadoop experiment where I 
spun up an embedded ES client inside a reducer with the intention of having 
each reducer own a single index. We partition data by day, so 365 indices = 
365 reducers = 365 single node ES clusters. I'd probably have each client 
work independently with no replication. Then all I have to do is have the 
client send the generated index to a snapshot repo (perhaps S3) within the 
close method of the reducer. Here's a totally hacky snippet to give you an 
idea http://pastebin.com/kH3Xsa5e   ..After that it's just a snapshot 
restore request on the live cluster to load it in from the snapshot repo. 

There's plenty I could do to improve that, it'll certainly require beefy 
task trackers, and that's livable. The part that makes me unhappy is that 
its using local storage of the task trackers rather than HDFS for writing 
the data. The performance will probably be better on local storage, but 
this approach would limit the size of an index to:

(the size of the disk on a task tracker) / (num reducer slots per node) / 
(magic number). 

Thus my question: You sort of alluded that one could use HDFS for writing 
elastic search data. I believe Lucene can write directly to HDFS b/c Solr 
can use HDFS as it's backing store. However I can't find any literature on 
how to back an embedded ES instance with HDFS. Is that possible? Where 
should I look?

Thanks!
Drew

On Wednesday, February 26, 2014 4:57:39 PM UTC-5, Costin Leau wrote:
>
> On 2/26/2014 11:26 PM, drew dahlke wrote: 
> > Hi Costin, 
> > 
> > We're very interested offline processing as well. To draw a parallel to 
> HBase, you could write a hadoop job that writes 
> > out to a table over the thrift API. However if you're going to load in 
> many terrabytes of data, there's the option to 
> > write out directly to the HTable file format and bulk load the file into 
> your cluster once generated. Bulk loading was 
> > orders of magnitude faster than the HTTP based API. 
> > 
> > Writing out lucene segments from a hadoop job is nothing new (check out 
> the katta project 
> > http://katta.sourceforge.net/). I saw that ES has snapshot 
> backup/restore in the pipeline. It'd be fantastic if we could 
> > write hadoop jobs that output data in the same format as ES backups and 
> then use the restore functionality in ES to bulk 
> > load the data directly without having to go through a REST API. I feel 
> like that would be faster and it would provide 
> > the flexibility to scale out the hadoop cluster independently of the ES 
> cluster. 
> > 
>
> If you are concerned about indexing data directly into a live cluster, you 
> could just have a different, staging one 
> setup (with a complete topology as well) to which you can index data. Then 
> do a snapshot (potentially to HDFS) and then 
> load the data into your live one. 
> This is already supported - see this blog post [1]. 
>
> Note that ES uses Lucene internally but the segments are just some part of 
> its internal metadata. 
>
> Recreating an ES index directly into a job means hacking and 
> reimplementing a lot of the distributed work that ES is 
> already doing for you without a lot (if any) performance gains: 
> - each job/task would have its own ES instance.  This means a 1:1 mapping 
> between the job tasks and ES nodes which is a 
> waste of resources. 
> - each ES instance would rely on the running task/job machine. This can 
> overload the hardware since you are forced to 
> co-locate the two whether you want it or not. 
> - at the end each ES instance would have to export its data somehow. Since 
> each node only gets some chunk of the data, 
> the indices would have to be aggregated. 
> This implies significant I/O (since you are moving the same data multiple 
> times) and at least twice the amount of disk 
> space.  The network I/O gets significantly amplified when using HDFS for 
> indexing since the disk is not local; with ES 
> you can chose to use HDFS or not (for best performance I would advise 
> against that). 
>
> Consider here the allocation of data within ES shards/nodes (based on the 
> cluster topology + user settings). For the 
> most part, this will be similar to another reindexing. 
>
> The current approach of es-hadoop has none of these issues and all the 
> benefits. You can scale Hadoop or ES independent 
> of each other - your job can have 10s (or 100s in some cases) of tasks 
> that are streaming data to an ES cluster of 5-10 
> beefy nodes. You can start with ES co-located on the same physical 
> machines as Hadoop and, as you grow move some or all 
> the nodes to a different setup. 
> Since es-hadoop parallelizes _both_ reads and writes, the hadoop job gets 
> full access to the ES cluster; the bigger the 
> target index is, the more shards it can talk to in parallel. 
>
> Additionally, there's minimal I/O - we only move the data needed _once_ to 
> ES. 
>
> If you have a performance problem caused by es-hadoop, I'd be happy to 
> look at the numbers/stats. 
>
> Hope this helps, 
>
> [1] http://www.elasticsearch.org/blog/elasticsearch-hadoop-1-3-m2/ 
>
> > 
> > On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote: 
> > 
> >     I'm not sure what you mean by "offline in Hadoop"... 
> >     Indexing the data requires ES or you could try and replicate it 
> manually but I would argue you'll end up duplicating 
> >     the 
> >     work done in ES. 
> >     You could potentially setup a smaller (even one node) ES cluster 
> just for indexing in parallel or collocated with your 
> >     Hadoop cluster - you could use this to do the indexing and then copy 
> the indexes over to the live cluster. 
> >     That is, you'll have two ES clusters: one for staging/indexing and 
> another one for live/read-only data... 
> > 
> > 
> >     On 21/06/2013 9:27 PM, Jack Liu wrote: 
> >     > Thanks Costin, 
> >     > 
> >     > I am afraid that I am not allowed to use it ( or any API), because 
> of the cluster policy. What I am looking for is to 
> >     > complete the indexing part entirely offline 
> >     > in the hadoop, is it feasible though? 
> >     > 
> >     > 
> >     > 
> >     > On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote: 
> >     > 
> >     >     Have you looked at Elasticsearch-Hadoop [1] ? You can use it 
> to stream data to/from ES to/from Hadoop. 
> >     > 
> >     >     [1]https://github.com/elasticsearch/elasticsearch-hadoop/ <
> https://github.com/elasticsearch/elasticsearch-hadoop/> 
> >     <https://github.com/elasticsearch/elasticsearch-hadoop/ <
> https://github.com/elasticsearch/elasticsearch-hadoop/>> 
> >     > 
> >     >     On 21/06/2013 8:38 PM, Jack Liu wrote: 
> >     >     > Hi all, 
> >     >     > 
> >     >     > I am new to ES, and we have large set of data need to be 
> indexed into ES cluster daily (there is no delta available, we 
> >     >     > only have 7~8 nodes). 
> >     >     > I know use mapper function to directly call client api 
> should be fine, however, our hadoop cluster policy does not allow 
> >     >     > that. 
> >     >     > So I am wondering if there is a way to just generate ES 
> index in the hadoop, and then copy them into the cluster and ES 
> >     >     > could pick them up when reloading. 
> >     >     > Or could anyone point me to right place in the source code 
> that is related to it. 
> >     >     > 
> >     >     > Any suggestion could be very helpful ! 
> >     >     > 
> >     >     > Many thanks 
> >     >     > Jack 
> >     >     > 
> >     >     > -- 
> >     >     > You received this message because you are subscribed to the 
> Google Groups "elasticsearch" group. 
> >     >     > To unsubscribe from this group and stop receiving emails 
> from it, send an email to 
> >     >     >[email protected] <javascript:>. 
> >     >     > For more options, visithttps://
> groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out> 
> <https://groups.google.com/groups/opt_out 
> >     <https://groups.google.com/groups/opt_out>>. 
> >     >     > 
> >     >     > 
> >     > 
> >     >     -- 
> >     >     Costin 
> >     > 
> >     > -- 
> >     > You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> >     > To unsubscribe from this group and stop receiving emails from it, 
> send an email to 
> >     >[email protected] <javascript:>. 
> >     > For more options, visithttps://groups.google.com/groups/opt_out <
> https://groups.google.com/groups/opt_out>. 
> >     > 
> >     > 
> > 
> >     -- 
> >     Costin 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to 
> > [email protected] <javascript:>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/2a32eb96-5c30-491a-a501-0a6950d1918f%40googlegroups.com.
>  
>
> > For more options, visit https://groups.google.com/groups/opt_out. 
>
> -- 
> Costin 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/13a0cfcd-3bce-4164-af07-e13f6bf7003d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: How to generate ES index in the hadoop

Reply via email to