Question about Elasticsearch and Spark

Ramdev Wudali Wed, 22 Oct 2014 09:58:34 -0700

Hi:
   I have a very simple application that queries an ES instance and returns 
the count of documents found by the query.  I am using the Spark interface 
as I intend to 
do run ML algorithms on the result set. With that said here are the 
problems I face :


1. If I set up the Configuration(to use in the newAPIHadoopRDD) or JobCnf 
(to use with hadoopRDD), using a remote ES instance like so :


This is using the new APIHadoopRDD interface

  val sparkConf = new 
SparkConf().setMaster("local[2]").setAppname("TestESSpark")
  sparkConf.set("spark.serializer",classOf[KyroSerializer].getName)
  val sc = new SparkContext(sparkConf)

   val conf = new Configuration   // change to new JobConf for the old API
   conf.set("es.nodes","remote.server:port")
   conf.set("es.resources","index/type")
   conf.set("es.query","{\"query\":{\"match_all\":{}}")
  val esRDD = 
sc.newAPIHadoopRDD(conf,classOf[EsInputFormat[Text,MapWritable]],classOf[Text],classOf[MapWritable])
 
 // change to hadoopRDD for the old API
  val docCount = esRDD.count
  println(docCount)


The application just hangs at the println. //((basically executing  the 
search or so I think).  


2. If I use localhost instead of "remote.server:port" for the es.nodes, the 
application throws an exception :
Exception in thread "main" 
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection 
error (check network and/or proxy settings)- all nodes failed; tried 
[[localhost:9200]] 
at 
org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:123)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:303)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:287)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:291)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:118)
at 
org.elasticsearch.hadoop.rest.RestClient.discoverNodes(RestClient.java:100)
at 
org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:57)
at 
org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:220)
at 
org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:406)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
at org.apache.spark.rdd.RDD.count(RDD.scala:904)
at 
trgr.rd.newsplus.pairgen.ElasticSparkTest1$.main(ElasticSparkTest1.scala:59)
at trgr.rd.newsplus.pairgen.ElasticSparkTest1.main(ElasticSparkTest1.scala)


I am using the 2.1.0.Beta2 version of the elasticsearch-hadoop library. 
 and running it against a local instance ES version 1.3.2/remote instance 
ES version 1.0.0

Any insight as to what I might be missing/doing wrong ?

Thanks

Ramdev

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2b42a015-9f39-4a38-963f-f75e7141547a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Question about Elasticsearch and Spark

Reply via email to