I set everything up using this
guide: https://github.com/amplab/shark/wiki/Running-Shark-on-EC2 on an ec2
cluster. I've copied the elasticsearch-hadoop jars into the hive lib
directory and I have elasticsearch running on localhost:9200. I'm running
shark in a screen session with --service screenserver and connecting to it
at the same time using shark -h localhost.
Unfortunately, when I attempt to write data into elasticsearch, it fails.
Here's an example:
[localhost:10000] shark> CREATE EXTERNAL TABLE wiki (id BIGINT, title STRING
, last_modified STRING, xml STRING, text STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-sample/';
Time taken (including network latency): 0.159 seconds
14/02/19 01:23:33 INFO CliDriver: Time taken (including network latency):
0.159 seconds
[localhost:10000] shark> SELECT title FROM wiki LIMIT 1;
Alpokalja
Time taken (including network latency): 2.23 seconds
14/02/19 01:23:48 INFO CliDriver: Time taken (including network latency):
2.23 seconds
[localhost:10000] shark> CREATE EXTERNAL TABLE es_wiki (id BIGINT, title
STRING, last_modified STRING, xml STRING, text STRING) STORED BY
'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource'
= 'wikipedia/article');
Time taken (including network latency): 0.061 seconds
14/02/19 01:33:51 INFO CliDriver: Time taken (including network latency):
0.061 seconds
[localhost:10000] shark> INSERT OVERWRITE TABLE es_wiki SELECT w.id, w.title
, w.last_modified, w.xml, w.text FROM wiki w;
[Hive Error]: Query returned non-zero code: 9, cause: FAILED: Execution
Error, return code -101 from shark.execution.SparkTask
Time taken (including network latency): 3.575 seconds
14/02/19 01:34:42 INFO CliDriver: Time taken (including network latency):
3.575 seconds
*The stack trace looks like this:*
org.apache.hadoop.hive.ql.metadata.HiveException
(org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Out
of nodes and retries; caught exception)
org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)
shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)
shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)
scala.collection.Iterator$class.foreach(Iterator.scala:772)
scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
shark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:81)
shark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:207)
shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)
shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)
org.apache.spark.scheduler.Task.run(Task.scala:53)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744
I should be using Hive 0.9.0, shark 0.8.1, elasticsearch 1.0.0, Hadoop
1.0.4, and java 1.7.0_51
Based on my cursory look at the hadoop and elasticsearch-hadoop sources, it
looks like hive is just rethrowing an IOException it's getting from Spark,
and elasticsearch-hadoop is just hitting those exceptions.
I suppose my questions are: Does this look like an issue with my
ES/elasticsearch-hadoop config? And has anyone gotten elasticsearch working
with Spark/Shark?
Any ideas/insights are appreciated.
Thanks,Max
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9486faff-3eaf-4344-8931-3121bbc5d9c7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.