Nicely put. So what you have is essentially a fancy nearest neighbor approach which requires that you scan all of your input unless you have a clever way of discarding those rows of X which have very small weights.
It still makes lots of sense to amortize the cost of this scan across multiple x_query values so I think that the distributed cache is ultimately a much better idea. The output of each map would be one A_i and b_i for each input vector in the distributed cache. The key can be the serial number or offset of the input vector in the distributed cache so that the partials will all go to the same reducer. One point that I don't quite see is how you accumulate partial sums of inv(X' * X). It seems that the inverse makes this difficult. This is orthogonal to the map-reduce architecture. How are you doing this? On Thu, Oct 21, 2010 at 7:58 AM, Alexander Hans <a...@ahans.de> wrote: > > > The issue with this kind of program is that there is usually a part of > the > > data that is bounded in size > > (the model) and a part of the data that is unbounded in size (the input > > data). The bounded portion is > > usually what is stored in distributed cache, even if not all mappers read > > all of that data. > > > > The unbounded part is normally then parceled out to the different data > > nodes. > > > > Why is your program the opposite? Need it be? Should it be? > > It isn't the opposite. The input data is distributed. I'm not sure what > you mean by model here, but if that's the model to be estimated, in case > of linear regression the coefficients, for the locally weighted version > this doesn't make sense as for each prediction input a different problem > needs to be solved. Maybe I should explain in detail what's happening: > > For linear regression, we have a matrix X that contains the input vectors > as rows and a column vector y that contains the dependent value (the > target). Now we want to find a set of coefficients theta such that > > (X * theta - y)^2 --> min > > Differentiating this w.r.t. theta and setting the result equal to zero, > one gets > > theta = inv(X' * X) * X' * y > > If one has X and y, using Matlab's backslash operator does exactly that > (therefore in Matlab one can do theta = X \ y). So that would give us the > optimal coefficients for that problem w.r.t. the squared error. > > In order to transform this into a task suitable for map-reduce, the > equation is split into theta = A * b, where A = inv(X' * X) and b = X' * > y. X can be very large, its number of rows is the number of training > samples, the number of columns is the input dimensionality. Compared to X, > A is usually small with number of rows and cols each the input > dimensionality. > > The mappers determine partial sums of matrix A and vector b using a subset > of X and y (a number of rows), let's call them A_i and b_i. The reducer > then sums all matrices A_i to the final A and does the same with b_i. It > finally computes theta = inv(A) * b. In the case of linear regression, > theta can be used to make multiple predictions (as X * theta), which could > be done in parallel using mappers, passing theta using the distributed > cache for instance. > > In the case of locally weighted regression, however, the prediction input > already influences the determination of theta. Suppose we're looking for > the output y_query at some x_query and have examples x_0,...,x_N --> > y_0,...y_N. Again, the x_i go into a matrix X and the y_i into a column > vector y. Additionally, each row of X and each row of y are multiplied > with the result of a function weight(x_i, x_query). Usually we want > examples close to x_query to get a higher weight than those further away. > If we formulate this as matrices again, we get > > theta = inv(X' * W * X) * X' * W * y, > > where W is a diagonal matrix containing the weight of each example, i.e., > W(x_i, x_i) = weight(x_i, x_query). > > The map-reduce formulation is almost identical, except that now the > mappers first calculate weight(x_i, x_query) and then use this to > determine their A_i and b_i. The task of the reducer is unchanged. The > resulting theta is used to make only one prediction, namely y_query = > x_query * theta. > > I hope that now it becomes clear why I need the prediction input in the > mappers. Unless one has millions of training examples, using map-reduce > here probably wouldn't make sense to make just one prediction. > > > >> shouldn't do the inversion literally. I'm now using Colt's Algebra.solve > >> as t = Algebra.solve(A, b). > [...] > > Is that code using LU decomposition or QR to do the solving? > > I think in my case it uses LU, but I would have to look that. I'm not at > home right now where the computer with the code is. > > > [kernels] > >> other algorithms will make use of kernels as well. > > That sounds useful, but for now the LWLR package is a fine place for > that. > > Alright. > > > >> Moreover, I had to > >> enable reading/writing of matrices using sequence files, I think I will > >> make a separate patch for that. > > Isn't there a MatrixWritable class for this? > > Yeah, but it actually doesn't do anything. There are just commented > matrix.write(out) and matrix.load(in) statements. I replaced those by > something similar to the code of VectorWritable. I think the idea there > was to let the actually used matrix implementation take care of the > details, but that wasn't done for Vectors and I think it's better to have > the details about input/output stream handling in the *Writable classes. > Otherwise we would create dependencies on the stream classes in the > Vector/Matrix classes. Also, there's no test for MatrixWritable. I think > tonight or tomorrow I can submit a patch containing both, a MatrixWritable > that actually writes and reads and also a test for that class. > > > Cheers, > > Alex > > >