Re: Profiling SequentialAccessSparseVector

Ted Dunning Thu, 18 Feb 2010 17:11:13 -0800

On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix <jake.man...@gmail.com> wrote:


> What would this method mean?  aggregatorUnit means what?  What would this
> be a method on?
>

This method would apply the mapFunction to each corresponding pair of
elements from the two vectors and then aggregate the results using the
aggregatorFunction.

The unit is the unit of the aggregator and would only be needed if the
vectors have no entries.  We could probably do without it.

This could be a static function or could be a method on vectorA.  Putting
the method on vectorA would probably be better because it could drive many
common optimizations.

Examples of this pattern include sum-squared-difference (agg = plus, map =
compose(sqr, minus)), dot (agg = plus, map = times).

This can be composed with a temporary output vector or sometimes by mutating
one of the operands.  This is not as desirable as just accumulating the
results on the fly, however.

 The reason why we need a specialized function is to do things in a nicely
> mutating way: Hadoop M/R is functional in the lispy-sensen: read-only
> immutable objects (once on the filesystem).
>

We definitely need that too.


>  The only thing more we need than what we have now is in the assign method
> -
> currently we have it with a map, with reduce being the identity (with
> replacement -
> the calling object becomes the output of the reduce -ie the output of the
> map):
>

That can work, but very often requires an extra copy of the vector as in the
distance case that Robin brought up.  The contract there says neither
operand can be changed which forces a vector copy in the current API.  A
mapReduce operation in addition to a map would allow us to avoid that
important case.

Re: Profiling SequentialAccessSparseVector

Reply via email to