If you're going to do be shuffling data to multiple worker nodes then data
will be crossing the network. Shuffling provides the foundation for certain
parallel computing tasks, such as performing large scale parallel
relational algebra.

For machine learning algorithms we'll likely need a parallel iterative
design which leaves the data in place.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 20, 2015 at 4:11 PM, Yonik Seeley <[email protected]> wrote:

> On Wed, May 20, 2015 at 11:06 AM, Noble Paul <[email protected]> wrote:
> > The problem with streaming is data locality. Data needs to be transferred
> > across network to do the processing
>
> Nothing saying that you can't process data before it's streamed out, right?
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to