Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

Noble Paul Wed, 20 May 2015 08:08:19 -0700

The problem with streaming is data locality. Data needs to be transferred
across network to do the processing
On May 20, 2015 8:15 PM, "Yonik Seeley (JIRA)" <[email protected]> wrote:


>
>     [
> https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552414#comment-14552414
> ]
>
> Yonik Seeley commented on SOLR-5069:
> ------------------------------------
>
> Looks like SOLR-6526 (Solr Streaming API) is pretty much map-reduce?
> And then on top is SOLR-7377 (Solr Streaming Expressions) and SOLR-7560
> (Parallel SQL)
>
> > MapReduce for SolrCloud
> > -----------------------
> >
> >                 Key: SOLR-5069
> >                 URL: https://issues.apache.org/jira/browse/SOLR-5069
> >             Project: Solr
> >          Issue Type: New Feature
> >          Components: SolrCloud
> >            Reporter: Noble Paul
> >            Assignee: Noble Paul
> >
> > Solr currently does not have a way to run long running computational
> tasks across the cluster. We can piggyback on the mapreduce paradigm so
> that users have smooth learning curve.
> >  * The mapreduce component will be written as a RequestHandler in Solr
> >  * Works only in SolrCloud mode. (No support for standalone mode)
> >  * Users can write MapReduce programs in Javascript or Java. First cut
> would be JS ( ? )
> > h1. sample word count program
> > h2.how to invoke?
> > http://host:port
> /solr/collection-x/mapreduce?map=<map-script>&reduce=<reduce-script>&sink=collectionX
> > h3. params
> > * map :  A javascript implementation of the map program
> > * reduce : a Javascript implementation of the reduce program
> > * sink : The collection to which the output is written. If this is not
> passed , the request will wait till completion and respond with the output
> of the reduce program and will be emitted as a standard solr response. . If
> no sink is passed the request will be redirected to the "reduce node" where
> it will wait till the process is complete. If the sink param is passed ,the
> rsponse will contain an id of the run which can be used to query the status
> in another command.
> > * reduceNode : Node name where the reduce is run . If not passed an
> arbitrary node is chosen
> > The node which received the command would first identify one replica
> from each slice where the map program is executed . It will also identify
> one another node from the same collection where the reduce program is run.
> Each run is given an id and the details of the nodes participating in the
> run will be written to ZK (as an ephemeral node).
> > h4. map script
> > {code:JavaScript}
> > var res = $.streamQuery($.param(“q"));//this is not run across the
> cluster. //Only on this index
> > while(res.hasMore()){
> >   var doc = res.next();
> >   map(doc);
> > }
> > function  map(doc) {
> >   var txt = doc.get(“txt”);//the field on which word count is performed
> >   var words = txt.split(" ");
> >    for(i = 0; i < words.length; i++){
> >       $.emit(words[i],{‘count’:1});// this will send the map over to
> //the reduce host
> >     }
> > }
> > {code}
> > Essentially two threads are created in the 'map' hosts . One for running
> the program and the other for co-ordinating with the 'reduce' host . The
> maps emitted are streamed live over an http connection to the reduce program
> > h4. reduce script
> > This script is run in one node . This node accepts http connections from
> map nodes and the 'maps' that are sent are collected in a queue which will
> be polled and fed into the reduce program. This also keeps the 'reduced'
> data in memory till the whole run is complete. It expects a "done" message
> from all 'map' nodes before it declares the tasks are complete. After
> reduce program is executed for all the input it proceeds to write out the
> result to the 'sink' collection or it is written straight out to the
> response.
> > {code:JavaScript}
> > var pair = $.nextMap();
> > var reduced = $.getCtx().getReducedMap();// a hashmap
> > var count = reduced.get(pair.key());
> > if(count === null) {
> >   count = {“count”:0};
> >   reduced.put(pair.key(), count);
> > }
> > count.count += pair.val().count ;
> > {code}
> > h4.example output
> > {code:JavaScript}
> > {
> > “result”:[
> > “wordx”:{
> >          “count”:15876765
> >          },
> > “wordy” : {
> >            “count”:24657654
> >           }
> >
> >   ]
> > }
> > {code}
> > TBD
> > * The format in which the output is written to the target collection, I
> assume the reducedMap will have values mapping to the schema of the
> collection
> >
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

Reply via email to