Re: CLI input formats and calling other jobs

Shannon Quinn Wed, 09 Jun 2010 19:15:14 -0700

Hi everyone,

Thanks so much for the feedback!


I think in the long run, having the ability to take either symmetric input
> (assume that it is already the matrix of similarities), or the "raw" input,
> would be nice.  For now, whatever is easiest for you to work with should be
> fine.


Theoretically, it is actually feasible to assume non-symmetric
affinity/similarity matrices, though this equates to a non-trivial
stochastic decomposition...so for the time being, I'm assuming symmetric :P

Sticking with SequenceFile is the way to go, with the Writable type dicated
> by what it is you're doing - when they're vectors, use VectorWritable, and
> when you just need to pass around some coefficients, you can send
> IntWritable or your own custom writable.


IntWritable, DoubleWritable, and VectorWritable should serve beautifully.
Just so I'm absolutely clear, though: the <IntWritable> is a row index to
the row of <VectorWritable>, correct?


> So your matrix A is waaaaay too big to just fit in memory, right?  So
> this code won't simply work on Big (or even Medium) Data.
>

That's an excellent point; these similarity matrices can be millions x
millions.


>
> You need to write a MapReduce job which takes your
> SequenceFile<IntWritable,VectorWritable> input, and does what
> your inner loop effectively does.  You can probably have a single
> Reducer which takes in all of the outputs of your Map job and build
> up a big Vector of results - you don't need a SparseMatrix, because
> it's diagonal, it can be represented as a (Dense) Vector.
>

EigencutsMapper and EigencutsReducer it is, then. Thanks! Also, good point
on the diagonal matrix being represented as a DenseVector.

Again, since A is a DistributedRowMatrix, there is a better way
> to compute this: if you look at what happens when you pre-multiply
> a matrix A by a diagonal matrix D, you're taking the column i of
> matrix A and multiplying it by d_ii.  When you post-multiply A
> by D, you're taking the j'th *row* of A and multiplying by d_j.
>
> End result: L_ij = d_i a_ij d_j
>

This is very nice; thanks for the reminder.


> The meat of the algorithm is just the "solve" method.  So if you're
> already set up with a (configure()'ed!) DistributedRowMatrix A which
> you want to decompose, then create a Matrix to store the
> eigenvectors, and create a List<Double> to store the eigenvalues,
> and do like Jeff said - just call:
>
>  solver.solve(A, desiredRank, eVectors, eValues, true);
>
> Just make sure that you've called configure() on A before doing this.
>

It's mainly the configuration objects that were throwing me for a loop, but
I think I've got that figured out now. Thanks!


> The rows of U will be your projections of your original similarity matrix
> onto the reduce dimensional space (they'll be Dense!), so yeah, this makes
> sense, but I'm not sure whether you want to normalize by the inverse
> of the eigenvalues or not, first (do you want U, or S^-1 * U? - by
> definition of S, the dispersion of points in the reduced space are going
> to be highly clustered along the first few eigenvector directions if
> you don't normalize...)
>

I purposely left this out since I figured it wasn't relevant to the
immediate questions I had, but you're absolutely correct that U needs to be
normalized (call it V). In the thesis, the rows of U are normalized to be
unit length (v_ij = u_ij / sqrt(sum_j(u_ij^2))), so that's roughly
equivalent to what you mentioned of S^-1 * U in terms of uncoupling the
points. Either approach will probably work.


> One note to add to Jeff's comment: your eigenvectors will live as the
> transpose of what you want for clustering, so you will need to instantiate
> a DistributedRowMatrix based on them (or, if they are small enough,
> just load the contents of the HDFS file into memory), and then call
> transpose().  The results of this are the thing you want to push into
> the kmeans job as input.
>

I was wondering about the matrix row/column orientation in terms of what
kmeans operated on. Thanks!


>
> Hope this helps more than confuses!
>
>
Very much helps. Thank you to you and Jeff, I really appreciate it!

Shannon

Re: CLI input formats and calling other jobs

Reply via email to