Re: [jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Jeff Eastman Mon, 27 Sep 2010 10:24:16 -0700

 Hi Shannon,

This is very interesting stuff. I was able to download your latest patchand, with a half an hour of busy-work editing, was able to make it allcompile and all the unit tests run. I will attach my revised patch foryour inspection. Hope this helps.


Cheers,
Jeff


On 9/27/10 10:31 AM, Shannon Quinn wrote:

It is (mostly) completed as it stands, but a few updates that have been
pushed out to some of the dependencies (DistributedRowMatrix and
DistributedLanczosSolver in particular) have broken the patch I had, so I
have another that's just about ready to commit with these fixes.

Had a big machine learning assignment that was due today, so hopefully I
should be able to submit this patch this week.

Shannon

On Sun, Sep 26, 2010 at 5:59 AM, Sean Owen (JIRA)<[email protected]>  wrote:

     [
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

Sean Owen updated MAHOUT-363:
-----------------------------

        Fix Version/s: 0.5
    Affects Version/s: 0.3
          Component/s: Clustering

Shannon was this completed? Do we have a final patch to be committed?

Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
------------------------------------------------------------------

                 Key: MAHOUT-363
                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
             Project: Mahout
          Issue Type: Task
          Components: Clustering
    Affects Versions: 0.3
            Reporter: Shannon Quinn
            Assignee: Shannon Quinn
             Fix For: 0.5

         Attachments: MAHOUT-363.patch, MAHOUT-363.patch,

MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch


Proposal Title: EigenCuts spectral clustering implementation on

map/reduce for Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn
Student E-mail: [email protected]
Organization/Project:Assigned Mentor:
Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not

known a priori. However, most techniques still require an explicit K to be
chosen, and most spectral algorithms' use of piecewise constant
approximation of eigenvectors breaks down when the clusters are tightly
coupled. EigenCuts[1] solves both these problems by choosing an eigenvector
to create a new cluster boundary and iterating until no more edges are cut.

Detailed Description
Clustering techniques are extremely useful unsupervised methods,

particularly within my field of computational biology, for situations where
the number (and often the characteristics as well) of classes expressed in
the data are not known a priori. K-means is a classic technique which, given
some K, attempts to label data points within a cluster as a function of
their distance (e.g. Euclidean) from the cluster's mean, iterating to
convergence.

Another approach is spectral clustering, which models the data as a

weighted, undirected graph in some n-dimensional space, and creates a matrix
M of transition probabilities between nodes. By computing the eigenvalues
and eigenvectors of this matrix, most spectral clustering techniques take
advantage of the fact that, for data with loosely coupled clusters, the K
leading eigenvectors will identify the roughly piecewise constant regions in
the data that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most

significant of which are having to choose an arbitrary K a priori, and in
the situation of tightly coupled clusters where the piecewise constant
approximation on the eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of

spectral clustering algorithm it works by constructing a Markov chain
representation of the data and computing the eigenvectors and eigenvalues of
the transition matrix. Eigenflows, or flow of probability by eigenvector,
have an associated half life of flow decay called eigenflow. By perturbing
the weights between nodes, it can be observed where bottlenecks exist in the
eigenflow's halflife, allowing for the identification of boundaries between
clusters. Thus, this algorithm iterates until no more cuts between clusters
need to be made, eliminating the need for an a prior K, and conferring the
ability to separate tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors

and eigenvalues at each iterative step, incurring a large computational
overhead. This problem can be adequately addressed within the map/reduce
framework and on a Hadoop cluster by parallelizing the computation of each
eigenvector and its associated eigenvalue. Apache Hama in particular, with
its specializations in graph and matrix data, will be crucial in
parallelizing the computations of transition matrices and their
corresponding eigenvalues and eigenvectors at each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the

University of Pittsburgh, I have been in contact with him for the past few
weeks, and we both envision and eagerly anticipate continued collaboration
over the course of the summer and this project's implementation. His
guidance in highlighting the finer points of the underlying theory, coupled
with my experience in and knowledge of software engineering, makes this is a
project we are both extremely excited about implementing.

Timeline
At the end of each sprint, there should be a concrete, functional

deliverable. It may not do much, but what it does will work and have full
coverage accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks -

familiarization with and dev deployments of Hadoop and Mahout; reading up on
documentation, fine-tuning the project plan and requirements. This part will
kick into high gear after May 6 (my last final exam and final academic
obligation, prior to the actual graduation ceremony), but likely nothing
before April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering

algorithm and parallelize on Mahout over Hadoop. Preliminary interface
allows for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral

clustering. Integrate map/reduce framework via Mahout and take advantage of
existing core calculation of transition matrices and associated eigenvectors
and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to

EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with

EigenCuts. Finalize interface for running the algorithm, selecting datasets
and visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit

tests, fix outstanding bugs.

Other Information
I am finishing up my last semester as a master's student in computational

biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at
Carnegie Mellon this coming fall. I have worked extensively with clustering
techniques as a master's student, as a significant amount of my work has
involved bioimage analysis within Dr Robert Murphy's lab. Previous work has
involved using HMMs to detect patterns in tuberculosis genomes and use
CLUSTALW to cluster those patterns according to geographic location. My
master's thesis involves use of matrix completion to infer linear
transformations of proteins and their associated subcellular locations
across varying cell conditions (drugs, cell lines, etc).

I unfortunately have limited experience with Apache Mahout and Hadoop,

but with an undergraduate computer science degree from Georgia Tech, and
after an internship with IBM ExtremeBlue, I feel I am extremely adept at
picking up new frameworks quickly.

References
[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for

Spectral Clustering. NIPS 2002.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Reply via email to