[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Shannon Quinn (JIRA) Mon, 14 Jun 2010 23:26:54 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shannon Quinn updated MAHOUT-363:
---------------------------------

    Attachment: MAHOUT-363_1.patch

First attempt at implementing Sprint 1: generic k-means spectral clustering. 
Lots of the details are wrong; lots of comments explaining my thoughts at 
various steps. This patch is primarily for community feedback, as it compiles 
but definitely does not work correctly. Thanks.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>         Attachments: MAHOUT-363_1.patch
>
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: [email protected]
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. His guidance in highlighting 
> the finer points of the underlying theory, coupled with my experience in and 
> knowledge of software engineering, makes this is a project we are both 
> extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional 
> deliverable. It may not do much, but what it does will work and have full 
> coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - 
> familiarization with and dev deployments of Hadoop and Mahout; reading up on 
> documentation, fine-tuning the project plan and requirements. This part will 
> kick into high gear after May 6 (my last final exam and final academic 
> obligation, prior to the actual graduation ceremony), but likely nothing 
> before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering 
> algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows 
> for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral 
> clustering. Integrate map/reduce framework via Mahout and take advantage of 
> existing core calculation of transition matrices and associated eigenvectors 
> and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to 
> EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with 
> EigenCuts. Finalize interface for running the algorithm, selecting datasets 
> and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit 
> tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational 
> biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at 
> Carnegie Mellon this coming fall. I have worked extensively with clustering 
> techniques as a master's student, as a significant amount of my work has 
> involved bioimage analysis within Dr Robert Murphy's lab. Previous work has 
> involved using HMMs to detect patterns in tuberculosis genomes and use 
> CLUSTALW to cluster those patterns according to geographic location. My 
> master's thesis involves use of matrix completion to infer linear 
> transformations of proteins and their associated subcellular locations across 
> varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but 
> with an undergraduate computer science degree from Georgia Tech, and after an 
> internship with IBM ExtremeBlue, I feel I am extremely adept at picking up 
> new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for 
> Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Reply via email to