Re: Profiling SequentialAccessSparseVector

Jake Mannix Thu, 18 Feb 2010 18:19:37 -0800

Don't we already have generalized scalar aggregation?  I thought I committed
that a while back.  Its very useful for inner products, distances, and
stats.

Vector accumulation using a BinaryFunction as a map just needs to be made
more efficient (sparsity and random accessibility taken into account), but
works.

The only remaining piece is something like accumulate(Vector v,
BinaryFunction map, BinaryFunction aggregator) - a method on Matrix, which
aggregates partial map() combinations af each row with the input Vector, and
returns a Vector.  This generalizes times(Vector).  I guess
Matrix.assign(Vector v, BinaryFunction map) could be useful for mutating a
matrix, but on HDFS would operate by making new sequencefiles.

  -jake

On Feb 18, 2010 5:11 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:

On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix <jake.man...@gmail.com> wrote:
> What would this metho...
This method would apply the mapFunction to each corresponding pair of
elements from the two vectors and then aggregate the results using the
aggregatorFunction.

The unit is the unit of the aggregator and would only be needed if the
vectors have no entries.  We could probably do without it.

This could be a static function or could be a method on vectorA.  Putting
the method on vectorA would probably be better because it could drive many
common optimizations.

Examples of this pattern include sum-squared-difference (agg = plus, map =
compose(sqr, minus)), dot (agg = plus, map = times).

This can be composed with a temporary output vector or sometimes by mutating
one of the operands.  This is not as desirable as just accumulating the
results on the fly, however.

The reason why we need a specialized function is to do things in a nicely >
mutating way: Hadoop M...
We definitely need that too.

> The only thing more we need than what we have now is in the assign method
> - > currently we ha...
That can work, but very often requires an extra copy of the vector as in the
distance case that Robin brought up.  The contract there says neither
operand can be changed which forces a vector copy in the current API.  A
mapReduce operation in addition to a map would allow us to avoid that
important case.

Re: Profiling SequentialAccessSparseVector

Reply via email to