Hi Shannon,

I'll take a crack at some of your questions, see below.

On 6/8/10 10:31 PM, Shannon Quinn wrote:
Hi all,

I have a few more questions regarding the inner workings of Mahout's command
line and data types. I apologize in advance for the naivete...and the length
:P

I'm working on the spectral clustering algorithm, and inherent to that
strategy is the fact that the algorithm works on pairwise data point
comparisons, not the raw data itself. Put another way, if I want to
implement a clustering algorithm for Photoshop CS6's magic wand, the pixel
intensities constitute the raw data, but what my algorithm actually works on
is a pairwise comparison of all the pixels (often a normalized form of
Euclidean distance with some KNN sprinkled in).

This preprocessing step - where n-dimensional data is converted to
matrix-form pairwise comparisons, or "similarities" - there are a lot of
options for how to implement this. My current strategy is for the algorithm
to accept data that has already been processed into similarity matrix form
and thereby leave the preprocessing to the user, but also to provide an
example using images and a script that converts the raw data to similarity
scores. Would this be a good setup? Any other suggestions? (I'd thought of
somehow adding a plugin architecture to allow for different implementations
of performing this similarity preprocessing, but I'm not sure where that
would mesh with the overall Mahout architecture, or that there would even be
time this summer for it)

Also, input data types: I've been using LDA and k-means as templates for my
work, and they make heavy use of SequenceFiles as their input format.
However, I'm unsure of how to further manipulate the data after using
seq2sparse on a CSV input file representing an NxN symmetric similarity
matrix, since k-means and LDA operate on text in all the available examples.
k-Means and LDA really operate on pure vectors. It is true that many of our examples produce these vectors by analyzing text documents but neither algorithm requires textual data. I suggest you look at the examples/s/m/j/o/a/m/clustering/syntheticcontrol/kmeans/Job. This job invokes a preprocessor (InputDriver.runJob()) on a CSV file to convert the file into Mahout Vector sequence files suitable for input to the clustering jobs. It then invokes CanopyDriver.runJob() on the vector sequence files to determine the initial k clusters for input to k-Means. Finally the KMeansDriver.runJob() operation completes the clustering. You can use this sort of job chaining in your own driver to orchestrate your various processing steps.
My other questions revolve around clarifications on the codebase. Right now
I'm implementing a canonical k-means spectral clustering algorithm, which if
you want greater detail on the theory you can read the post on my blog (
http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/),
but the basic steps I'll reproduce here:

1) Convert raw data to similarity, or "affinity", scores ->  matrix A (done
internally)
2) Build diagonal matrix D of degree vertices (basically sum of rows of A)
Is there a more efficient way to do this step than the following:

Matrix D = new SparseMatrix(cardinality);
for (int i = 0; i<  A.numRows(); i++) {
D.set(i, i, A.getRow(i).zSum());
}
This looks pretty reasonable to me.
3) Construct a "normalized" Laplacian via the formula: L = D^(-1/2)AD^(-1/2)
I have this:

Matrix L = D.times(A.times(D));

Since D is a diagonal matrix, the exponent operation is simply raising each
individual matrix element to the -0.5 power; I assume this involves
implementing a new UnaryFunction? Or does a method already exist (aside from
manually looping over the matrix diagonal)?
You could use a UnaryFunction and trust the Matrix code to iterate over the diagonals efficiently using sparse vectors (it might do well, you could test this) but it is a pretty simple loop to write too.
4) Perform eigen-decomposition on L.
This is where DistributedLanczosSolver comes in, but I'm not sure how to
fire off the job from within another job since the run() method requires
command-line parameters.

The DistributedLanczosSolver.run method does require command line parameters and it is factored a bit differently than the clustering jobs in this regard. It is pretty simple to factor out a runJob method that can be called with pure Java arguments. I don't know why this could not be committed if it helps you use it, but I'll leave that decision to somebody who is actively in that code.

public void runJob(Configuration originalConfig, String inputPathString, String outputTmpPathString, int numRows, int numCols, boolean isSymmetric, int desiredRank, Matrix eigenVectors, List<Double> eigenValues, String outputEigenVectorPath)
      throws IOException {
    DistributedRowMatrix matrix = new DistributedRowMatrix(inputPathString,
outputTmpPathString,
                                                           numRows,
                                                           numCols);
    matrix.configure(new JobConf(originalConfig));
    solve(matrix, desiredRank, eigenVectors, eigenValues, isSymmetric);
    serializeOutput(eigenVectors, eigenValues, outputEigenVectorPath);
  }

5) Perform k-means clustering on rows of matrix U consisting of column
eigenvectors of L.
Same problem as #4 - how to invoke the existing k-means job from within my
own job.
Just invoke the runJob method as in syntheticcontrol, above. Also the respective unit tests for the clustering algorithms all have programmatic invocation examples.
I know a lot of this is inexperience, so again I apologize, but I do
appreciate your patience :)

Regards,
Shannon


Reply via email to