Re: [jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Shannon Quinn Mon, 27 Sep 2010 07:32:20 -0700

It is (mostly) completed as it stands, but a few updates that have been
pushed out to some of the dependencies (DistributedRowMatrix and
DistributedLanczosSolver in particular) have broken the patch I had, so I
have another that's just about ready to commit with these fixes.


Had a big machine learning assignment that was due today, so hopefully I
should be able to submit this patch this week.

Shannon

On Sun, Sep 26, 2010 at 5:59 AM, Sean Owen (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Sean Owen updated MAHOUT-363:
> -----------------------------
>
>        Fix Version/s: 0.5
>    Affects Version/s: 0.3
>          Component/s: Clustering
>
> Shannon was this completed? Do we have a final patch to be committed?
>
> > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> > ------------------------------------------------------------------
> >
> >                 Key: MAHOUT-363
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
> >             Project: Mahout
> >          Issue Type: Task
> >          Components: Clustering
> >    Affects Versions: 0.3
> >            Reporter: Shannon Quinn
> >            Assignee: Shannon Quinn
> >             Fix For: 0.5
> >
> >         Attachments: MAHOUT-363.patch, MAHOUT-363.patch,
> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch,
> MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch
> >
> >
> > Proposal Title: EigenCuts spectral clustering implementation on
> map/reduce for Apache Mahout (addresses issue Mahout-328)
> > Student Name: Shannon Quinn
> > Student E-mail: [email protected]
> > Organization/Project:Assigned Mentor:
> > Proposal Abstract:
> > Clustering algorithms are advantageous when the number of classes are not
> known a priori. However, most techniques still require an explicit K to be
> chosen, and most spectral algorithms' use of piecewise constant
> approximation of eigenvectors breaks down when the clusters are tightly
> coupled. EigenCuts[1] solves both these problems by choosing an eigenvector
> to create a new cluster boundary and iterating until no more edges are cut.
> > Detailed Description
> > Clustering techniques are extremely useful unsupervised methods,
> particularly within my field of computational biology, for situations where
> the number (and often the characteristics as well) of classes expressed in
> the data are not known a priori. K-means is a classic technique which, given
> some K, attempts to label data points within a cluster as a function of
> their distance (e.g. Euclidean) from the cluster's mean, iterating to
> convergence.
> > Another approach is spectral clustering, which models the data as a
> weighted, undirected graph in some n-dimensional space, and creates a matrix
> M of transition probabilities between nodes. By computing the eigenvalues
> and eigenvectors of this matrix, most spectral clustering techniques take
> advantage of the fact that, for data with loosely coupled clusters, the K
> leading eigenvectors will identify the roughly piecewise constant regions in
> the data that correspond to clusters.
> > However, these techniques all suffer from drawbacks, the two most
> significant of which are having to choose an arbitrary K a priori, and in
> the situation of tightly coupled clusters where the piecewise constant
> approximation on the eigenvectors no longer holds.
> > The EigenCuts algorithm addresses both these issues. As a type of
> spectral clustering algorithm it works by constructing a Markov chain
> representation of the data and computing the eigenvectors and eigenvalues of
> the transition matrix. Eigenflows, or flow of probability by eigenvector,
> have an associated half life of flow decay called eigenflow. By perturbing
> the weights between nodes, it can be observed where bottlenecks exist in the
> eigenflow's halflife, allowing for the identification of boundaries between
> clusters. Thus, this algorithm iterates until no more cuts between clusters
> need to be made, eliminating the need for an a prior K, and conferring the
> ability to separate tightly coupled clusters.
> > The only disadvantage of EigenCuts is the need to recompute eigenvectors
> and eigenvalues at each iterative step, incurring a large computational
> overhead. This problem can be adequately addressed within the map/reduce
> framework and on a Hadoop cluster by parallelizing the computation of each
> eigenvector and its associated eigenvalue. Apache Hama in particular, with
> its specializations in graph and matrix data, will be crucial in
> parallelizing the computations of transition matrices and their
> corresponding eigenvalues and eigenvectors at each iteration.
> > Since Dr Chennubhotla is currently a member of the faculty at the
> University of Pittsburgh, I have been in contact with him for the past few
> weeks, and we both envision and eagerly anticipate continued collaboration
> over the course of the summer and this project's implementation. His
> guidance in highlighting the finer points of the underlying theory, coupled
> with my experience in and knowledge of software engineering, makes this is a
> project we are both extremely excited about implementing.
> > Timeline
> > At the end of each sprint, there should be a concrete, functional
> deliverable. It may not do much, but what it does will work and have full
> coverage accompanying unit tests.
> > Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks -
> familiarization with and dev deployments of Hadoop and Mahout; reading up on
> documentation, fine-tuning the project plan and requirements. This part will
> kick into high gear after May 6 (my last final exam and final academic
> obligation, prior to the actual graduation ceremony), but likely nothing
> before April 29 (the day of my thesis defense).
> > Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering
> algorithm and parallelize on Mahout over Hadoop. Preliminary interface
> allows for dataset selection and visualization of resulting clusters.
> > Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral
> clustering. Integrate map/reduce framework via Mahout and take advantage of
> existing core calculation of transition matrices and associated eigenvectors
> and eigenvalues.
> > Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to
> EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> > Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with
> EigenCuts. Finalize interface for running the algorithm, selecting datasets
> and visualizing results.
> > Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit
> tests, fix outstanding bugs.
> > Other Information
> > I am finishing up my last semester as a master's student in computational
> biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at
> Carnegie Mellon this coming fall. I have worked extensively with clustering
> techniques as a master's student, as a significant amount of my work has
> involved bioimage analysis within Dr Robert Murphy's lab. Previous work has
> involved using HMMs to detect patterns in tuberculosis genomes and use
> CLUSTALW to cluster those patterns according to geographic location. My
> master's thesis involves use of matrix completion to infer linear
> transformations of proteins and their associated subcellular locations
> across varying cell conditions (drugs, cell lines, etc).
> > I unfortunately have limited experience with Apache Mahout and Hadoop,
> but with an undergraduate computer science degree from Georgia Tech, and
> after an internship with IBM ExtremeBlue, I feel I am extremely adept at
> picking up new frameworks quickly.
> > References
> > [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for
> Spectral Clustering. NIPS 2002.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Reply via email to