Re: Aggregation api

Dmitriy Lyubimov Sun, 15 Jun 2014 21:24:27 -0700

i would be opposed to that. some argumentation in order of priority.
(1) as i was mentioning elsewhere, the two opposites here is functional
programming (spark) and algebraic explicitness (matlab). functional things
make messy, hard to read code. non-functional things keep things clean.

(2) i conciously keep things non-functional with exception of map. I do not
want functional things that imply shuffle tasks. Remember, we are not in
the map-reduce world anymore. Once we introduce shuffle-reduce things,
there's no reason not to introduce shuffle-combine, shuffle-cogroup,
shuffle-cartesian, shuffle-zip. At which point we are back to Spark. My
answer to all those shuffle-dependent operations -- take rdd property and
do specific shuffles all you want. Also, doing custom shuffle tasks takes
optimiztion away from optimizer -- unlike custom map task (map block) with
restrictions.

(3) We have been going by two things : keep it R-like, and add things as
needed. I am still waiting for the case where such aggregate api would be
needed, in distributed setting. For example, i dont even see the need for
this non-zero count thing. Current api is more than enough to do that, e.g.
i would have done it something along the lines of transforming A into
indicator matrix and taking colsums:

val colCounts = a.mapBlock() {
   case (keys, block) =>
     for ( row <- block; el <- row.iterateNonZero) row(el.index) = 1.0
     keys -> block
}
  .colSums()

assuming it doesn't trigger any fail-fast asserts, we are done here, and in
a far more efficient way than with aggregate api in 1464 patch.

(warning: this code may create side effects in certain pipelines).

i could probably write up another 3 to 5 weeker argumentation amounting to
one strong item, but i got a cold and tired and kind of dont care that much.
On Jun 15, 2014 4:55 PM, "Pat Ferrel" <[email protected]> wrote:

> Seem like a good idea. The current use for aggregation seems pretty
> limited due to being non-distributed. The drm and rdd allow for easy
> construction of closures for processing blocks (like drm.mapBlock) but
> having an api to plug in closures or functions for
> aggregations/accumulators might be a nice piece of syntactic sugar.
>
> I’ve only seen the Scala aggregation stuff used in tests to compare the
> results of small in core matrix ops to distributed ones. There are separate
> Matrix methods (sometimes using aggregations) and DRM methods, one is
> non-distributed the other distributed. DrmLike currently seems to only
> support row-wise mapBlock but Dmitriy may know better.
>
> On Jun 14, 2014, at 6:45 PM, Ted Dunning <[email protected]> wrote:
>
> In
> math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala:
>
> > @@ -188,8 +188,8 @@ object MatrixOps {
> >      def apply(f: Vector): Double = f.sum
> >    }
> >
> > -  private def vectorCountFunc = new VectorFunction {
> > -    def apply(f: Vector): Double = f.aggregate(Functions.PLUS,
> Functions.greater(0))
> > +  private def vectorCountNonZeroElementsFunc = new VectorFunction {
> > +    def apply(f: Vector): Double = f.aggregate(Functions.PLUS,
> Functions.notEqual(0))
> The issue I have is with the rowAggregation and columnAggregation API. It
> enforces row by row evaluation. A map-reduce API could evaluate in many
> different orders and could iterate by rows or by columns for either
> aggregation and wouldn't require the a custom VectorFunction for simple
> aggregations.
>
> —
> Reply to this email directly or view it on GitHub.
>
>
>

Re: Aggregation api

Reply via email to