Passing small amounts of data via configuration is reasonable to do, but it
isn't clear that this
is a good idea for you.  Do you really only want to pass around a single
input vector for an entire
map-reduce invocation?  Map-reduce takes a looong time to get started.

If you might possibly want to pass many input vectors to each mapper then
the distributed cache
mechanism is probably a better bet for what you want to do.  Basically what
you are trying to do is
isomorphic to a map-side join and that is the normal mechanism used for
that.

Also, I have written an implementation of LSMR for iterative linear
solution.  Would that be helpful
for you?  I think that you may have mentioned that you were looking at LSQR
some time ago.  LSMR
is a follow on algorithm for that.  If you are interested, take a look at
https://issues.apache.org/jira/browse/MAHOUT-499
and the git repo referenced there.  As soon as the 0.4 release goes out, I
will likely commit that in
case you need it.

On Wed, Oct 20, 2010 at 3:29 AM, Alexander Hans <a...@ahans.de> wrote:

> Hi,
>
> I've finally got some work done on the LWLR implementation. It's already
> functional when used with fixed weights of 1, i.e., linear regression. In
> that case each mapper gets a vector from the training data and calculates
> the A matrix (X'*W*X, with W being a diagonal matrix containing the
> weights for each training vector, currently W = I) and b vector (B'*W*y,
> again currently with W = I) for that training vector. The reducer then
> sums the individual As and bs to get the final A and b which are then used
> to calculate the coefficients vector theta (I think it would be a good
> idea to have combiners calculating partial sums and then letting the
> reducer calculate the final sum from the combiners' output). It then loads
> another file containing input vectors for the prediction phase, constructs
> a matrix X from those vectors, and calculates the output as y = X * theta.
>
> Now for LWLR it doesn't work like that, since for each prediction input we
> need another theta vector, so as a first step it would make sense to give
> the algorithm set of training vectors (containing input vectors and target
> scalars) and just one prediction input vector. Then each mapper would do
> just the same as it does now, except that it would also calculate the
> weight for its training vector using the training input vector and the
> prediction input vector. Now I come to my question: How can I share the
> prediction input vector between those individual mappers? I don't want
> each mapper have it load from I file. I think a good solution would be to
> pass it using the configuration. In a Hadoop related forum or list someone
> suggested to serialize the object that you want to share to a String and
> then put that String into the configuration. Do you think that's a good
> idea? If yes, what is the proper Mahout way of serializing a Vector to a
> String and deserializing from String to Vector later?
>
>
> Thanks,
>
> Alex
>
>

Reply via email to