CLI input formats and calling other jobs

Shannon Quinn Tue, 08 Jun 2010 22:32:03 -0700

Hi all,

I have a few more questions regarding the inner workings of Mahout's command
line and data types. I apologize in advance for the naivete...and the length
:P


I'm working on the spectral clustering algorithm, and inherent to that
strategy is the fact that the algorithm works on pairwise data point
comparisons, not the raw data itself. Put another way, if I want to
implement a clustering algorithm for Photoshop CS6's magic wand, the pixel
intensities constitute the raw data, but what my algorithm actually works on
is a pairwise comparison of all the pixels (often a normalized form of
Euclidean distance with some KNN sprinkled in).

This preprocessing step - where n-dimensional data is converted to
matrix-form pairwise comparisons, or "similarities" - there are a lot of
options for how to implement this. My current strategy is for the algorithm
to accept data that has already been processed into similarity matrix form
and thereby leave the preprocessing to the user, but also to provide an
example using images and a script that converts the raw data to similarity
scores. Would this be a good setup? Any other suggestions? (I'd thought of
somehow adding a plugin architecture to allow for different implementations
of performing this similarity preprocessing, but I'm not sure where that
would mesh with the overall Mahout architecture, or that there would even be
time this summer for it)

Also, input data types: I've been using LDA and k-means as templates for my
work, and they make heavy use of SequenceFiles as their input format.
However, I'm unsure of how to further manipulate the data after using
seq2sparse on a CSV input file representing an NxN symmetric similarity
matrix, since k-means and LDA operate on text in all the available examples.

My other questions revolve around clarifications on the codebase. Right now
I'm implementing a canonical k-means spectral clustering algorithm, which if
you want greater detail on the theory you can read the post on my blog (
http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/),
but the basic steps I'll reproduce here:

1) Convert raw data to similarity, or "affinity", scores -> matrix A (done
internally)
2) Build diagonal matrix D of degree vertices (basically sum of rows of A)
Is there a more efficient way to do this step than the following:

Matrix D = new SparseMatrix(cardinality);
for (int i = 0; i < A.numRows(); i++) {
D.set(i, i, A.getRow(i).zSum());
}

3) Construct a "normalized" Laplacian via the formula: L = D^(-1/2)AD^(-1/2)
I have this:

Matrix L = D.times(A.times(D));

Since D is a diagonal matrix, the exponent operation is simply raising each
individual matrix element to the -0.5 power; I assume this involves
implementing a new UnaryFunction? Or does a method already exist (aside from
manually looping over the matrix diagonal)?

4) Perform eigen-decomposition on L.
This is where DistributedLanczosSolver comes in, but I'm not sure how to
fire off the job from within another job since the run() method requires
command-line parameters.

5) Perform k-means clustering on rows of matrix U consisting of column
eigenvectors of L.
Same problem as #4 - how to invoke the existing k-means job from within my
own job.

I know a lot of this is inexperience, so again I apologize, but I do
appreciate your patience :)

Regards,
Shannon

CLI input formats and calling other jobs

Reply via email to