Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

Joel Bernstein Wed, 20 May 2015 10:39:48 -0700

The Streaming Expressions language is a DSL to process docs and emit
processed data. The parallel SQL engine will also fit into this category.
Both of these languages compile to the Streaming API which is basically a
real-time map-reduce framework that runs on SolrCloud worker nodes.

The Streaming API has excellent data locality for a Map-Reduce engine
because it performs the map stage and sorting and partitioning of result
sets inside of Solr before tuples are streamed.  Sorted and partitioned
tuples are then sent directly to the correct worker nodes to be reduced.
The Streaming API doesn't follow a strict map/reduce model though. Streams
are merged and manipulated by wrapping decorator streams around each other.
So the streaming API is much more flexible then old style map/reduce.

But the Streaming API is not designed for parallel iterative algorithms
like gradient descent. For the parallel iterative case it's much faster to
leave the data in place and run embedded algorithm inside of the Solr.

At this point data must cross the network if you have multiple worker nodes.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 20, 2015 at 5:57 PM, Noble Paul <[email protected]> wrote:

>
>
> On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley <[email protected]> wrote:
>
>> On Wed, May 20, 2015 at 12:04 PM, Noble Paul <[email protected]>
>> wrote:
>> >
>> > On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley <[email protected]>
>> wrote:
>> >>
>> >> On Wed, May 20, 2015 at 11:06 AM, Noble Paul <[email protected]>
>> wrote:
>> >> > The problem with streaming is data locality. Data needs to be
>> >> > transferred
>> >> > across network to do the processing
>> >>
>> >> Nothing saying that you can't process data before it's streamed out,
>> >> right?
>> >
>> > yes, if our query language is expressive enough . Sometimes you need a
>> > little programming language to achieve that
>>
>> Right - and different languages can go on top of the base streaming
>> stuff... either before or after the streaming step.
>> There's no reason we can't stream derived data - it doesn't need to be
>> just documents.
>>
> Yes, but is there away to do it now? If we can have a DSL which can do
> process docs and emit the processed data , then the streaming API may be
> able to do without data locality .
>
> I guess the streaming API run as a standalone program. can it not be
> running soemwhere in the Solr cluster itself?
>
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> -----------------------------------------------------
> Noble Paul
>

Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

Reply via email to