[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718518#comment-13718518
 ] 

Lukas Vlcek commented on SOLR-5069:
-----------------------------------

[~porqpine]: Well, I see the point. From the user point of view this sounds 
very cool and it will be interesting to see how this feature works out.

Though, this reminds me the situation that happened in Google couple of year 
ago (I heard this from one ex-Googler, not sure if there is any official 
evidence) when they introduced MR platform internally and all summer interns 
started using it. A lot of non-optimal tasks started eating their resources - 
because it is so easy to translate a lot of problems into MR (but it does not 
mean that MR solution to the problem is the optimal one).

As for setting up a separate analytical platform, well... the cost of setting 
it up is one thing, but the benefit of existing tooling and experience is 
another one. Are you going to reimplement Mahout into Solr? - well may be you 
are not aiming at this level of complexity.

You can throttle the thing on many levels, as a result the task will just run 
longer, right? Isn't this in fact a big challenge? If I understand Lucene 
correctly, the costly part is if you need to keep aged IndexReaders around 
because this leads to higher number of opened segments and consumption of 
related resources. And what if the data included into the MR calculation 
changes (reindex/delete) in the meantime? Then you need to be careful in 
presenting the results to the clients because they may be too used to Hadoop MR 
where the original data set is still available.

Anyway, I am sure you are already aware of all this. I am just curious :-)
                
> MapReduce for SolrCloud
> -----------------------
>
>                 Key: SOLR-5069
>                 URL: https://issues.apache.org/jira/browse/SOLR-5069
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Noble Paul
>            Assignee: Noble Paul
>
> Solr currently does not have a way to run long running computational tasks 
> across the cluster. We can piggyback on the mapreduce paradigm so that users 
> have smooth learning curve.
>  * The mapreduce component will be written as a RequestHandler in Solr
>  * Works only in SolrCloud mode. (No support for standalone mode) 
>  * Users can write MapReduce programs in Javascript or Java. First cut would 
> be JS ( ? )
> h1. sample word count program
> h2.how to invoke?
> http://host:port/solr/collection-x/mapreduce?map=<map-script>&reduce=<reduce-script>&sink=collectionX
> h3. params 
> * map :  A javascript implementation of the map program
> * reduce : a Javascript implementation of the reduce program
> * sink : The collection to which the output is written. If this is not passed 
> , the request will wait till completion and respond with the output of the 
> reduce program and will be emitted as a standard solr response. . If no sink 
> is passed the request will be redirected to the "reduce node" where it will 
> wait till the process is complete. If the sink param is passed ,the rsponse 
> will contain an id of the run which can be used to query the status in 
> another command.
> * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
> node is chosen
> The node which received the command would first identify one replica from 
> each slice where the map program is executed . It will also identify one 
> another node from the same collection where the reduce program is run. Each 
> run is given an id and the details of the nodes participating in the run will 
> be written to ZK (as an ephemeral node). 
> h4. map script 
> {code:JavaScript}
> var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
> this index
> while(res.hasMore()){
>   var doc = res.next();
>   var txt = doc.get(“txt”);//the field on which word count is performed
>   var words = txt.split(" ");
>    for(i = 0; i < words.length; i++){
>       $.map(words[i],{‘count’:1});// this will send the map over to //the 
> reduce host
>     }
> }
> {code}
> Essentially two threads are created in the 'map' hosts . One for running the 
> program and the other for co-ordinating with the 'reduce' host . The maps 
> emitted are streamed live over an http connection to the reduce program
> h4. reduce script
> This script is run in one node . This node accepts http connections from map 
> nodes and the 'maps' that are sent are collected in a queue which will be 
> polled and fed into the reduce program. This also keeps the 'reduced' data in 
> memory till the whole run is complete. It expects a "done" message from all 
> 'map' nodes before it declares the tasks are complete. After  reduce program 
> is executed for all the input it proceeds to write out the result to the 
> 'sink' collection or it is written straight out to the response.
> {code:JavaScript}
> var pair = $.nextMap();
> var reduced = $.getCtx().getReducedMap();// a hashmap
> var count = reduced.get(pair.key());
> if(count === null) {
>   count = {“count”:0};
>   reduced.put(pair.key(), count);
> }
> count.count += pair.val().count ;
> {code}
> h4.example output
> {code:JavaScript}
> {
> “result”:[
> “wordx”:{ 
>          “count”:15876765
>          },
> “wordy” : {
>            “count”:24657654
>           }
>  
>   ]
> }
> {code}
> TBD
> * The format in which the output is written to the target collection, I 
> assume the reducedMap will have values mapping to the schema of the collection
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to