Re: [hadoop] Getting elasticsearch-hadoop working with Shark

Max Lang Wed, 19 Feb 2014 14:03:07 -0800

Hey Costin,

Thanks for the swift reply. I abandoned EC2 to take that out of the 
equation and managed to get everything working locally using the latest 
version of everything (though I realized just now I'm still on hive 0.9). 
I'm guessing you're right about some port connection issue because I 
definitely had ES running on that machine.


I changed hive-log4j.properties and added 
#custom logging levels
#log4j.logger.xxx=DEBUG 
log4j.logger.org.elasticsearch.hadoop.rest=TRACE 
log4j.logger.org.elasticsearch.hadoop.mr=TRACE

But I didn't see any trace logging. Hopefully I can get it working on EC2 
without issue, but, for the future, is this the correct way to set TRACE 
logging?
Oh and, for reference, I tried running without ES up and I got the 
following, exceptions:

2014-02-19 13:46:08,803 ERROR shark.SharkDriver 
(Logging.scala:logError(64)) - FAILED: Hive Internal Error: 
java.lang.IllegalStateException(Cannot discover Elasticsearch version) 
java.lang.IllegalStateException: Cannot discover Elasticsearch version 
at 
org.elasticsearch.hadoop.hive.EsStorageHandler.init(EsStorageHandler.java:101) 

at 
org.elasticsearch.hadoop.hive.EsStorageHandler.configureOutputJobProperties(EsStorageHandler.java:83)
 

at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:706)
 

at 
org.apache.hadoop.hive.ql.plan.PlanUtils.configureOutputJobPropertiesForStorageHandler(PlanUtils.java:675)
 

at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.augmentPlan(FileSinkOperator.java:764)
 

at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.putOpInsertMap(SemanticAnalyzer.java:1518)
 

at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:4337)
 

at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:6207)
 

at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:6138)
 

at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6764)
 

at 
shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:149)
 

at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:244)
 

at shark.SharkDriver.compile(SharkDriver.scala:215) 
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) 
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:895) 
at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:324) 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) 
at shark.SharkCliDriver$.main(SharkCliDriver.scala:232) 
at shark.SharkCliDriver.main(SharkCliDriver.scala) 
Caused by: java.io.IOException: Out of nodes and retries; caught exception 
at 
org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:81) 
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:221) 
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:205) 
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:209) 
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:103) 
at org.elasticsearch.hadoop.rest.RestClient.esVersion(RestClient.java:274) 
at 
org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:84)
 

at 
org.elasticsearch.hadoop.hive.EsStorageHandler.init(EsStorageHandler.java:99) 

... 18 more 
Caused by: java.net.ConnectException: Connection refused 
at java.net.PlainSocketImpl.socketConnect(Native Method) 
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) 

at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 

at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) 
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) 
at java.net.Socket.connect(Socket.java:579) 
at java.net.Socket.connect(Socket.java:528) 
at java.net.Socket.<init>(Socket.java:425) 
at java.net.Socket.<init>(Socket.java:280) 
at 
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
 

at 
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
 

at 
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) 
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
 

at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
 

at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) 
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) 
at 
org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport.execute(CommonsHttpTransport.java:160)
 

at 
org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:74)
... 25 more

Let me know if there's anything in particular you'd like me to try on EC2.

(For posterity, the versions I used were: hadoop 2.2.0, hive 0.9.0, shark 
8.1, spark 8.1, es-hadoop 1.3.0.M2, java 1.7.0_15, scala 2.9.3, 
elasticsearch 1.0.0)

Thanks again,
Max

On Tuesday, February 18, 2014 10:16:38 PM UTC-8, Costin Leau wrote:
>
> The error indicates a network error - namely es-hadoop cannot connect to 
> Elasticsearch on the default (localhost:9200) 
> HTTP port. Can you double check whether that's indeed the case (using curl 
> or even telnet on that port) - maybe the 
> firewall prevents any connections to be made... 
> Also you could try using the latest Hive, 0.12 and a more recent Hadoop 
> such as 1.1.2 or 1.2.1. 
>
> Additionally, can you enable TRACE logging in your job on es-hadoop 
> packages org.elasticsearch.hadoop.rest and 
> org.elasticsearch.hadoop.mr packages and report back ? 
>
> Thanks, 
>
> On 19/02/2014 4:03 AM, Max Lang wrote: 
> > I set everything up using this guide: 
> https://github.com/amplab/shark/wiki/Running-Shark-on-EC2 on an ec2 
> cluster. I've 
> > copied the elasticsearch-hadoop jars into the hive lib directory and I 
> have elasticsearch running on localhost:9200. I'm 
> > running shark in a screen session with --service screenserver and 
> connecting to it at the same time using shark -h 
> > localhost. 
> > 
> > Unfortunately, when I attempt to write data into elasticsearch, it 
> fails. Here's an example: 
> > 
> > | 
> > [localhost:10000]shark>CREATE EXTERNAL TABLE wiki (id BIGINT,title 
> STRING,last_modified STRING,xml STRING,text 
> > STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'LOCATION 
> 's3n://spark-data/wikipedia-sample/'; 
> > Timetaken (including network latency):0.159seconds 
> > 14/02/1901:23:33INFO CliDriver:Timetaken (including network 
> latency):0.159seconds 
> > 
> > [localhost:10000]shark>SELECT title FROM wiki LIMIT 1; 
> > Alpokalja 
> > Timetaken (including network latency):2.23seconds 
> > 14/02/1901:23:48INFO CliDriver:Timetaken (including network 
> latency):2.23seconds 
> > 
> > [localhost:10000]shark>CREATE EXTERNAL TABLE es_wiki (id BIGINT,title 
> STRING,last_modified STRING,xml STRING,text 
> > STRING)STORED BY 
> 'org.elasticsearch.hadoop.hive.EsStorageHandler'TBLPROPERTIES('es.resource'='wikipedia/article');
>  
>
> > Timetaken (including network latency):0.061seconds 
> > 14/02/1901:33:51INFO CliDriver:Timetaken (including network 
> latency):0.061seconds 
> > 
> > [localhost:10000]shark>INSERT OVERWRITE TABLE es_wiki SELECT 
> > w.id,w.title,w.last_modified,w.xml,w.text 
> FROM wiki w; 
> > [HiveError]:Queryreturned non-zero 
> code:9,cause:FAILED:ExecutionError,returncode 
> -101fromshark.execution.SparkTask 
> > Timetaken (including network latency):3.575seconds 
> > 14/02/1901:34:42INFO CliDriver:Timetaken (including network 
> latency):3.575seconds 
> > | 
> > 
> > *The stack trace looks like this:* 
> > 
> > org.apache.hadoop.hive.ql.metadata.HiveException 
> (org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
> > Out of nodes and retries; caught exception) 
> > 
> > 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)scala.collection.Iterator$class.foreach(Iterator.scala:772)scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)shark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:81)shark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:207)shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)org.apache.spark.deploy.Sp
>  
>
> arkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:744
>  
>
> > I should be using Hive 0.9.0, shark 0.8.1, elasticsearch 1.0.0, Hadoop 
> 1.0.4, and java 1.7.0_51 
> > Based on my cursory look at the hadoop and elasticsearch-hadoop sources, 
> it looks like hive is just rethrowing an 
> > IOException it's getting from Spark, and elasticsearch-hadoop is just 
> hitting those exceptions. 
> > I suppose my questions are: Does this look like an issue with my 
> ES/elasticsearch-hadoop config? And has anyone gotten 
> > elasticsearch working with Spark/Shark? 
> > Any ideas/insights are appreciated. 
> > Thanks,Max 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to 
> > [email protected] <javascript:>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/9486faff-3eaf-4344-8931-3121bbc5d9c7%40googlegroups.com.
>  
>
> > For more options, visit https://groups.google.com/groups/opt_out. 
>
> -- 
> Costin 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/86187c3a-0974-4d10-9689-e83da788c04a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [hadoop] Getting elasticsearch-hadoop working with Shark

Reply via email to