Passing small amounts of data via configuration is reasonable to do, but it isn't clear that this is a good idea for you. Do you really only want to pass around a single input vector for an entire map-reduce invocation? Map-reduce takes a looong time to get started.
If you might possibly want to pass many input vectors to each mapper then the distributed cache mechanism is probably a better bet for what you want to do. Basically what you are trying to do is isomorphic to a map-side join and that is the normal mechanism used for that. Also, I have written an implementation of LSMR for iterative linear solution. Would that be helpful for you? I think that you may have mentioned that you were looking at LSQR some time ago. LSMR is a follow on algorithm for that. If you are interested, take a look at https://issues.apache.org/jira/browse/MAHOUT-499 and the git repo referenced there. As soon as the 0.4 release goes out, I will likely commit that in case you need it. On Wed, Oct 20, 2010 at 3:29 AM, Alexander Hans <a...@ahans.de> wrote: > Hi, > > I've finally got some work done on the LWLR implementation. It's already > functional when used with fixed weights of 1, i.e., linear regression. In > that case each mapper gets a vector from the training data and calculates > the A matrix (X'*W*X, with W being a diagonal matrix containing the > weights for each training vector, currently W = I) and b vector (B'*W*y, > again currently with W = I) for that training vector. The reducer then > sums the individual As and bs to get the final A and b which are then used > to calculate the coefficients vector theta (I think it would be a good > idea to have combiners calculating partial sums and then letting the > reducer calculate the final sum from the combiners' output). It then loads > another file containing input vectors for the prediction phase, constructs > a matrix X from those vectors, and calculates the output as y = X * theta. > > Now for LWLR it doesn't work like that, since for each prediction input we > need another theta vector, so as a first step it would make sense to give > the algorithm set of training vectors (containing input vectors and target > scalars) and just one prediction input vector. Then each mapper would do > just the same as it does now, except that it would also calculate the > weight for its training vector using the training input vector and the > prediction input vector. Now I come to my question: How can I share the > prediction input vector between those individual mappers? I don't want > each mapper have it load from I file. I think a good solution would be to > pass it using the configuration. In a Hadoop related forum or list someone > suggested to serialize the object that you want to share to a String and > then put that String into the configuration. Do you think that's a good > idea? If yes, what is the proper Mahout way of serializing a Vector to a > String and deserializing from String to Vector later? > > > Thanks, > > Alex > >