Hi everyone, Thanks so much for the feedback!
I think in the long run, having the ability to take either symmetric input > (assume that it is already the matrix of similarities), or the "raw" input, > would be nice. For now, whatever is easiest for you to work with should be > fine. Theoretically, it is actually feasible to assume non-symmetric affinity/similarity matrices, though this equates to a non-trivial stochastic decomposition...so for the time being, I'm assuming symmetric :P Sticking with SequenceFile is the way to go, with the Writable type dicated > by what it is you're doing - when they're vectors, use VectorWritable, and > when you just need to pass around some coefficients, you can send > IntWritable or your own custom writable. IntWritable, DoubleWritable, and VectorWritable should serve beautifully. Just so I'm absolutely clear, though: the <IntWritable> is a row index to the row of <VectorWritable>, correct? > So your matrix A is waaaaay too big to just fit in memory, right? So > this code won't simply work on Big (or even Medium) Data. > That's an excellent point; these similarity matrices can be millions x millions. > > You need to write a MapReduce job which takes your > SequenceFile<IntWritable,VectorWritable> input, and does what > your inner loop effectively does. You can probably have a single > Reducer which takes in all of the outputs of your Map job and build > up a big Vector of results - you don't need a SparseMatrix, because > it's diagonal, it can be represented as a (Dense) Vector. > EigencutsMapper and EigencutsReducer it is, then. Thanks! Also, good point on the diagonal matrix being represented as a DenseVector. Again, since A is a DistributedRowMatrix, there is a better way > to compute this: if you look at what happens when you pre-multiply > a matrix A by a diagonal matrix D, you're taking the column i of > matrix A and multiplying it by d_ii. When you post-multiply A > by D, you're taking the j'th *row* of A and multiplying by d_j. > > End result: L_ij = d_i a_ij d_j > This is very nice; thanks for the reminder. > The meat of the algorithm is just the "solve" method. So if you're > already set up with a (configure()'ed!) DistributedRowMatrix A which > you want to decompose, then create a Matrix to store the > eigenvectors, and create a List<Double> to store the eigenvalues, > and do like Jeff said - just call: > > solver.solve(A, desiredRank, eVectors, eValues, true); > > Just make sure that you've called configure() on A before doing this. > It's mainly the configuration objects that were throwing me for a loop, but I think I've got that figured out now. Thanks! > The rows of U will be your projections of your original similarity matrix > onto the reduce dimensional space (they'll be Dense!), so yeah, this makes > sense, but I'm not sure whether you want to normalize by the inverse > of the eigenvalues or not, first (do you want U, or S^-1 * U? - by > definition of S, the dispersion of points in the reduced space are going > to be highly clustered along the first few eigenvector directions if > you don't normalize...) > I purposely left this out since I figured it wasn't relevant to the immediate questions I had, but you're absolutely correct that U needs to be normalized (call it V). In the thesis, the rows of U are normalized to be unit length (v_ij = u_ij / sqrt(sum_j(u_ij^2))), so that's roughly equivalent to what you mentioned of S^-1 * U in terms of uncoupling the points. Either approach will probably work. > One note to add to Jeff's comment: your eigenvectors will live as the > transpose of what you want for clustering, so you will need to instantiate > a DistributedRowMatrix based on them (or, if they are small enough, > just load the contents of the HDFS file into memory), and then call > transpose(). The results of this are the thing you want to push into > the kmeans job as input. > I was wondering about the matrix row/column orientation in terms of what kmeans operated on. Thanks! > > Hope this helps more than confuses! > > Very much helps. Thank you to you and Jeff, I really appreciate it! Shannon
