The problem with streaming is data locality. Data needs to be transferred across network to do the processing On May 20, 2015 8:15 PM, "Yonik Seeley (JIRA)" <[email protected]> wrote:
> > [ > https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552414#comment-14552414 > ] > > Yonik Seeley commented on SOLR-5069: > ------------------------------------ > > Looks like SOLR-6526 (Solr Streaming API) is pretty much map-reduce? > And then on top is SOLR-7377 (Solr Streaming Expressions) and SOLR-7560 > (Parallel SQL) > > > MapReduce for SolrCloud > > ----------------------- > > > > Key: SOLR-5069 > > URL: https://issues.apache.org/jira/browse/SOLR-5069 > > Project: Solr > > Issue Type: New Feature > > Components: SolrCloud > > Reporter: Noble Paul > > Assignee: Noble Paul > > > > Solr currently does not have a way to run long running computational > tasks across the cluster. We can piggyback on the mapreduce paradigm so > that users have smooth learning curve. > > * The mapreduce component will be written as a RequestHandler in Solr > > * Works only in SolrCloud mode. (No support for standalone mode) > > * Users can write MapReduce programs in Javascript or Java. First cut > would be JS ( ? ) > > h1. sample word count program > > h2.how to invoke? > > http://host:port > /solr/collection-x/mapreduce?map=<map-script>&reduce=<reduce-script>&sink=collectionX > > h3. params > > * map : A javascript implementation of the map program > > * reduce : a Javascript implementation of the reduce program > > * sink : The collection to which the output is written. If this is not > passed , the request will wait till completion and respond with the output > of the reduce program and will be emitted as a standard solr response. . If > no sink is passed the request will be redirected to the "reduce node" where > it will wait till the process is complete. If the sink param is passed ,the > rsponse will contain an id of the run which can be used to query the status > in another command. > > * reduceNode : Node name where the reduce is run . If not passed an > arbitrary node is chosen > > The node which received the command would first identify one replica > from each slice where the map program is executed . It will also identify > one another node from the same collection where the reduce program is run. > Each run is given an id and the details of the nodes participating in the > run will be written to ZK (as an ephemeral node). > > h4. map script > > {code:JavaScript} > > var res = $.streamQuery($.param(“q"));//this is not run across the > cluster. //Only on this index > > while(res.hasMore()){ > > var doc = res.next(); > > map(doc); > > } > > function map(doc) { > > var txt = doc.get(“txt”);//the field on which word count is performed > > var words = txt.split(" "); > > for(i = 0; i < words.length; i++){ > > $.emit(words[i],{‘count’:1});// this will send the map over to > //the reduce host > > } > > } > > {code} > > Essentially two threads are created in the 'map' hosts . One for running > the program and the other for co-ordinating with the 'reduce' host . The > maps emitted are streamed live over an http connection to the reduce program > > h4. reduce script > > This script is run in one node . This node accepts http connections from > map nodes and the 'maps' that are sent are collected in a queue which will > be polled and fed into the reduce program. This also keeps the 'reduced' > data in memory till the whole run is complete. It expects a "done" message > from all 'map' nodes before it declares the tasks are complete. After > reduce program is executed for all the input it proceeds to write out the > result to the 'sink' collection or it is written straight out to the > response. > > {code:JavaScript} > > var pair = $.nextMap(); > > var reduced = $.getCtx().getReducedMap();// a hashmap > > var count = reduced.get(pair.key()); > > if(count === null) { > > count = {“count”:0}; > > reduced.put(pair.key(), count); > > } > > count.count += pair.val().count ; > > {code} > > h4.example output > > {code:JavaScript} > > { > > “result”:[ > > “wordx”:{ > > “count”:15876765 > > }, > > “wordy” : { > > “count”:24657654 > > } > > > > ] > > } > > {code} > > TBD > > * The format in which the output is written to the target collection, I > assume the reducedMap will have values mapping to the schema of the > collection > > > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
