Hi Costin, We're very interested offline processing as well. To draw a parallel to HBase, you could write a hadoop job that writes out to a table over the thrift API. However if you're going to load in many terrabytes of data, there's the option to write out directly to the HTable file format and bulk load the file into your cluster once generated. Bulk loading was orders of magnitude faster than the HTTP based API.
Writing out lucene segments from a hadoop job is nothing new (check out the katta project http://katta.sourceforge.net/). I saw that ES has snapshot backup/restore in the pipeline. It'd be fantastic if we could write hadoop jobs that output data in the same format as ES backups and then use the restore functionality in ES to bulk load the data directly without having to go through a REST API. I feel like that would be faster and it would provide the flexibility to scale out the hadoop cluster independently of the ES cluster. On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote: > > I'm not sure what you mean by "offline in Hadoop"... > Indexing the data requires ES or you could try and replicate it manually > but I would argue you'll end up duplicating the > work done in ES. > You could potentially setup a smaller (even one node) ES cluster just for > indexing in parallel or collocated with your > Hadoop cluster - you could use this to do the indexing and then copy the > indexes over to the live cluster. > That is, you'll have two ES clusters: one for staging/indexing and another > one for live/read-only data... > > > On 21/06/2013 9:27 PM, Jack Liu wrote: > > Thanks Costin, > > > > I am afraid that I am not allowed to use it ( or any API), because of > the cluster policy. What I am looking for is to > > complete the indexing part entirely offline > > in the hadoop, is it feasible though? > > > > > > > > On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote: > > > > Have you looked at Elasticsearch-Hadoop [1] ? You can use it to > stream data to/from ES to/from Hadoop. > > > > [1] https://github.com/elasticsearch/elasticsearch-hadoop/ < > https://github.com/elasticsearch/elasticsearch-hadoop/> > > > > On 21/06/2013 8:38 PM, Jack Liu wrote: > > > Hi all, > > > > > > I am new to ES, and we have large set of data need to be indexed > into ES cluster daily (there is no delta available, we > > > only have 7~8 nodes). > > > I know use mapper function to directly call client api should be > fine, however, our hadoop cluster policy does not allow > > > that. > > > So I am wondering if there is a way to just generate ES index in > the hadoop, and then copy them into the cluster and ES > > > could pick them up when reloading. > > > Or could anyone point me to right place in the source code that is > related to it. > > > > > > Any suggestion could be very helpful ! > > > > > > Many thanks > > > Jack > > > > > > -- > > > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > > > To unsubscribe from this group and stop receiving emails from it, > send an email to > > >[email protected] <javascript:>. > > > For more options, visithttps://groups.google.com/groups/opt_out < > https://groups.google.com/groups/opt_out>. > > > > > > > > > > -- > > Costin > > > > -- > > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to > > [email protected] <javascript:>. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > Costin > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a32eb96-5c30-491a-a501-0a6950d1918f%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
