should be, but they use groupByKey(). the output of that operator implies
full group with no intermediate product (i.e [K, Array[T]] type for
groups). I don't think it implies partial groups, as something like group
count would not work in that case as there's no way to reduce combiner
outputs with this semantics.  *I think* if they wanted to use combiners,
they would have to use combine() api with clear specification for what
output of combiner might be.

but your point is well-taken. one wouldn't need to use 1000 reducers to
meaningfully overcome the skew.  even if there is 1:1000 absolute count
skew after combiners, that still doesn't really qualify as much of a
reducer skew. A.t would eliminate even that unevenness. Which is not a big
deal. Problem is, I want to use linear algebra to handle that, not
combine(). Yeah, i suppose there are cases that fit linalg better and there
are ones that fit less. And then there are cases for which  we haven't
abstracted all the necessary toolset yet.



On Tue, Apr 8, 2014 at 3:24 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > I suspect mllib code would suffer from non-determinsitic parallelism
> > behavior in its Lloyd iteration (as well as as memory overflow risk) in
> > certain corner case situations such as there are a lot of datapoints but
> > very few clusters sought. Spark (for the right reasons) doesn't believe
> in
> > sort-n-spill stuff which means centroid recomputation may suffer from
> > decreased or assymmetric parallelism after groupByKey call, especially if
> > cluster attribution counts end up heavily skewed for whatever reason.
> >
> > E.g. if you have 1 bln points and two centroids, it's my understanding
> that
> > post-attribution centroid reduction will create only two tasks each
> > processing no less than 500 mln attributed points, even if cluster
> capacity
> > quite adequately offers 500 tasks for this job.
> >
>
> This should be susceptible to combiner-like pre-aggregation.  That should
> give essentially perfect parallelism.
>

Reply via email to