Don't we already have generalized scalar aggregation? I thought I committed that a while back. Its very useful for inner products, distances, and stats.
Vector accumulation using a BinaryFunction as a map just needs to be made more efficient (sparsity and random accessibility taken into account), but works. The only remaining piece is something like accumulate(Vector v, BinaryFunction map, BinaryFunction aggregator) - a method on Matrix, which aggregates partial map() combinations af each row with the input Vector, and returns a Vector. This generalizes times(Vector). I guess Matrix.assign(Vector v, BinaryFunction map) could be useful for mutating a matrix, but on HDFS would operate by making new sequencefiles. -jake On Feb 18, 2010 5:11 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix <jake.man...@gmail.com> wrote: > What would this metho... This method would apply the mapFunction to each corresponding pair of elements from the two vectors and then aggregate the results using the aggregatorFunction. The unit is the unit of the aggregator and would only be needed if the vectors have no entries. We could probably do without it. This could be a static function or could be a method on vectorA. Putting the method on vectorA would probably be better because it could drive many common optimizations. Examples of this pattern include sum-squared-difference (agg = plus, map = compose(sqr, minus)), dot (agg = plus, map = times). This can be composed with a temporary output vector or sometimes by mutating one of the operands. This is not as desirable as just accumulating the results on the fly, however. The reason why we need a specialized function is to do things in a nicely > mutating way: Hadoop M... We definitely need that too. > The only thing more we need than what we have now is in the assign method > - > currently we ha... That can work, but very often requires an extra copy of the vector as in the distance case that Robin brought up. The contract there says neither operand can be changed which forces a vector copy in the current API. A mapReduce operation in addition to a map would allow us to avoid that important case.