Re: Newbie question about Spark and Elasticsearch

chris Thu, 18 Dec 2014 10:28:07 -0800

Hi,

You recommend the native integration instead of MR and I see on the 
official documentation that MR is recommended to read/write data to ES 
using spark. Spark support Doc 
<http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/2.1.Beta/spark.html>


what would be the basic piece of code to read data from ES without using MR 
?

I'm currently struggling with EsInputFormat[org.apache.hadoop.io.Text, 
MapWritable] structure.

my code is :

val sc = new SparkContext(...) 

  val configuration = new Configuration()
  configuration.set("es.nodes", "xxxxxx")
  configuration.set("es.port", "9200")
  configuration.set("es.resource", resource) // my index/type
  configuration.set("es.query", query) //basicaly a match_all

  val esRDD = sc.newAPIHadoopRDD(configuration, 
classOf[EsInputFormat[org.apache.hadoop.io.Text, 
MapWritable]],classOf[org.apache.hadoop.io.Text], classOf[MapWritable])

assume my data is mapped as follow :

{
    "oceannetworks": {
        "mappings": {
            "transcript": {
                "properties": {
                    "cruiseID": {
                        "type": "string"
                    },
                    "diveID": {
                        "type": "string"
                    },
                    "filename_root": {
                        "type": "string"
                    },
                    "id": {
                        "type": "string"
                    },
                    "result": {
                        "type": "nested",
                        "properties": {
                            "begin_time": {
                                "type": "double"
                            },
                            "confidence": {
                                "type": "double"
                            },
                            "end_time": {
                                "type": "double"
                            },
                            "location": {
                                "type": "geo_point"
                            },
                            "word": {
                                "type": "string"
                            }
                        }
                    },
                    "status": {
                        "type": "string"
                    },
                    "uuid": {
                        "type": "string"
                    },
                    "version": {
                        "type": "string"
                    }
                }
            }
        }
    }
}


I'm able to retrieve 1st level information like diveID , cruiseID ... but 
it's not clear how to get the 2nd lvl collection "result". It seams I get a 
WritableArrayWritable but I'm not sure how to handle it.

I get 1st lvl data with these king of code :

val uuids = esRDD.map(_._2.get(new 
org.apache.hadoop.io.Text("uuid")).toString).take(10)

I could use a little bit of help :)

thanks.

chris

Le lundi 8 décembre 2014 10:19:12 UTC-5, Costin Leau a écrit :
>
> Hi, 
>
> First off I recommend using the native integration (aka the Java/Scala 
> APIs) instead of MapReduce. The latter works but 
> the former is better performing and more flexible. 
>
> ES works in a similar fashion to the HDFS store - the data doesn't go 
> through the master rather, each task has its own 
> partition on works on its own set of data. Behind the scenes we map each 
> worker to an index shard (if there aren't 
> enough workers, then some will work across multiple shards). 
>
>
> On 12/8/14 4:59 PM, Mohamed Lrhazi wrote: 
> > am trying to understand how spark and ES work... could someone please 
> help me answer this question.. 
> > 
> > val conf = new Configuration() 
> > conf.set("es.resource", "radio/artists") 
> > conf.set("es.query", "?q=me*") 
> > val esRDD = sc.newHadoopRDD(conf, classOf[EsInputFormat[Text, 
> MapWritable]], 
> >                                    classOf[Text], classOf[MapWritable])) 
> > val docCount = esRDD.count(); 
> > 
> > 
> > When and where is data being transferred from ES? is it all collected on 
> the Spark master node, then partitioned and 
> > sent to the worker nodes? or is each worker node talking to ES to 
> somehow get a partition of the data? 
> > 
> > How does this effectively work? 
> > 
> > Thanks a lot, 
> > Mohamed. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to 
> > [email protected] <javascript:> <mailto:
> [email protected] <javascript:>>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/CAEU_gmf9Nt0xn_0NbzDn_moRWUT96uWYf4cicJdZik3r0Zz8XA%40mail.gmail.com
>  
> > <
> https://groups.google.com/d/msgid/elasticsearch/CAEU_gmf9Nt0xn_0NbzDn_moRWUT96uWYf4cicJdZik3r0Zz8XA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>  
>
> > For more options, visit https://groups.google.com/d/optout. 
>
> -- 
> Costin 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/959a9e83-5cae-4428-a45d-3ae5266af275%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Newbie question about Spark and Elasticsearch

Reply via email to