should be, but they use groupByKey(). the output of that operator implies full group with no intermediate product (i.e [K, Array[T]] type for groups). I don't think it implies partial groups, as something like group count would not work in that case as there's no way to reduce combiner outputs with this semantics. *I think* if they wanted to use combiners, they would have to use combine() api with clear specification for what output of combiner might be.
but your point is well-taken. one wouldn't need to use 1000 reducers to meaningfully overcome the skew. even if there is 1:1000 absolute count skew after combiners, that still doesn't really qualify as much of a reducer skew. A.t would eliminate even that unevenness. Which is not a big deal. Problem is, I want to use linear algebra to handle that, not combine(). Yeah, i suppose there are cases that fit linalg better and there are ones that fit less. And then there are cases for which we haven't abstracted all the necessary toolset yet. On Tue, Apr 8, 2014 at 3:24 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > I suspect mllib code would suffer from non-determinsitic parallelism > > behavior in its Lloyd iteration (as well as as memory overflow risk) in > > certain corner case situations such as there are a lot of datapoints but > > very few clusters sought. Spark (for the right reasons) doesn't believe > in > > sort-n-spill stuff which means centroid recomputation may suffer from > > decreased or assymmetric parallelism after groupByKey call, especially if > > cluster attribution counts end up heavily skewed for whatever reason. > > > > E.g. if you have 1 bln points and two centroids, it's my understanding > that > > post-attribution centroid reduction will create only two tasks each > > processing no less than 500 mln attributed points, even if cluster > capacity > > quite adequately offers 500 tasks for this job. > > > > This should be susceptible to combiner-like pre-aggregation. That should > give essentially perfect parallelism. >