Hi:
I have a very simple application that queries an ES instance and returns
the count of documents found by the query. I am using the Spark interface
as I intend to
do run ML algorithms on the result set. With that said here are the
problems I face :
1. If I set up the Configuration(to use in the newAPIHadoopRDD) or JobCnf
(to use with hadoopRDD), using a remote ES instance like so :
This is using the new APIHadoopRDD interface
val sparkConf = new
SparkConf().setMaster("local[2]").setAppname("TestESSpark")
sparkConf.set("spark.serializer",classOf[KyroSerializer].getName)
val sc = new SparkContext(sparkConf)
val conf = new Configuration // change to new JobConf for the old API
conf.set("es.nodes","remote.server:port")
conf.set("es.resources","index/type")
conf.set("es.query","{\"query\":{\"match_all\":{}}")
val esRDD =
sc.newAPIHadoopRDD(conf,classOf[EsInputFormat[Text,MapWritable]],classOf[Text],classOf[MapWritable])
// change to hadoopRDD for the old API
val docCount = esRDD.count
println(docCount)
The application just hangs at the println. //((basically executing the
search or so I think).
2. If I use localhost instead of "remote.server:port" for the es.nodes, the
application throws an exception :
Exception in thread "main"
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection
error (check network and/or proxy settings)- all nodes failed; tried
[[localhost:9200]]
at
org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:123)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:303)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:287)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:291)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:118)
at
org.elasticsearch.hadoop.rest.RestClient.discoverNodes(RestClient.java:100)
at
org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:57)
at
org.elasticsearch.hadoop.rest.RestService.findPartitions(RestService.java:220)
at
org.elasticsearch.hadoop.mr.EsInputFormat.getSplits(EsInputFormat.java:406)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
at org.apache.spark.rdd.RDD.count(RDD.scala:904)
at
trgr.rd.newsplus.pairgen.ElasticSparkTest1$.main(ElasticSparkTest1.scala:59)
at trgr.rd.newsplus.pairgen.ElasticSparkTest1.main(ElasticSparkTest1.scala)
I am using the 2.1.0.Beta2 version of the elasticsearch-hadoop library.
and running it against a local instance ES version 1.3.2/remote instance
ES version 1.0.0
Any insight as to what I might be missing/doing wrong ?
Thanks
Ramdev
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b42a015-9f39-4a38-963f-f75e7141547a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.