It is (mostly) completed as it stands, but a few updates that have been pushed out to some of the dependencies (DistributedRowMatrix and DistributedLanczosSolver in particular) have broken the patch I had, so I have another that's just about ready to commit with these fixes.
Had a big machine learning assignment that was due today, so hopefully I should be able to submit this patch this week. Shannon On Sun, Sep 26, 2010 at 5:59 AM, Sean Owen (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Sean Owen updated MAHOUT-363: > ----------------------------- > > Fix Version/s: 0.5 > Affects Version/s: 0.3 > Component/s: Clustering > > Shannon was this completed? Do we have a final patch to be committed? > > > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) > > ------------------------------------------------------------------ > > > > Key: MAHOUT-363 > > URL: https://issues.apache.org/jira/browse/MAHOUT-363 > > Project: Mahout > > Issue Type: Task > > Components: Clustering > > Affects Versions: 0.3 > > Reporter: Shannon Quinn > > Assignee: Shannon Quinn > > Fix For: 0.5 > > > > Attachments: MAHOUT-363.patch, MAHOUT-363.patch, > MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, > MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, > MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, > MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, > MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch, MAHOUT-363.patch > > > > > > Proposal Title: EigenCuts spectral clustering implementation on > map/reduce for Apache Mahout (addresses issue Mahout-328) > > Student Name: Shannon Quinn > > Student E-mail: [email protected] > > Organization/Project:Assigned Mentor: > > Proposal Abstract: > > Clustering algorithms are advantageous when the number of classes are not > known a priori. However, most techniques still require an explicit K to be > chosen, and most spectral algorithms' use of piecewise constant > approximation of eigenvectors breaks down when the clusters are tightly > coupled. EigenCuts[1] solves both these problems by choosing an eigenvector > to create a new cluster boundary and iterating until no more edges are cut. > > Detailed Description > > Clustering techniques are extremely useful unsupervised methods, > particularly within my field of computational biology, for situations where > the number (and often the characteristics as well) of classes expressed in > the data are not known a priori. K-means is a classic technique which, given > some K, attempts to label data points within a cluster as a function of > their distance (e.g. Euclidean) from the cluster's mean, iterating to > convergence. > > Another approach is spectral clustering, which models the data as a > weighted, undirected graph in some n-dimensional space, and creates a matrix > M of transition probabilities between nodes. By computing the eigenvalues > and eigenvectors of this matrix, most spectral clustering techniques take > advantage of the fact that, for data with loosely coupled clusters, the K > leading eigenvectors will identify the roughly piecewise constant regions in > the data that correspond to clusters. > > However, these techniques all suffer from drawbacks, the two most > significant of which are having to choose an arbitrary K a priori, and in > the situation of tightly coupled clusters where the piecewise constant > approximation on the eigenvectors no longer holds. > > The EigenCuts algorithm addresses both these issues. As a type of > spectral clustering algorithm it works by constructing a Markov chain > representation of the data and computing the eigenvectors and eigenvalues of > the transition matrix. Eigenflows, or flow of probability by eigenvector, > have an associated half life of flow decay called eigenflow. By perturbing > the weights between nodes, it can be observed where bottlenecks exist in the > eigenflow's halflife, allowing for the identification of boundaries between > clusters. Thus, this algorithm iterates until no more cuts between clusters > need to be made, eliminating the need for an a prior K, and conferring the > ability to separate tightly coupled clusters. > > The only disadvantage of EigenCuts is the need to recompute eigenvectors > and eigenvalues at each iterative step, incurring a large computational > overhead. This problem can be adequately addressed within the map/reduce > framework and on a Hadoop cluster by parallelizing the computation of each > eigenvector and its associated eigenvalue. Apache Hama in particular, with > its specializations in graph and matrix data, will be crucial in > parallelizing the computations of transition matrices and their > corresponding eigenvalues and eigenvectors at each iteration. > > Since Dr Chennubhotla is currently a member of the faculty at the > University of Pittsburgh, I have been in contact with him for the past few > weeks, and we both envision and eagerly anticipate continued collaboration > over the course of the summer and this project's implementation. His > guidance in highlighting the finer points of the underlying theory, coupled > with my experience in and knowledge of software engineering, makes this is a > project we are both extremely excited about implementing. > > Timeline > > At the end of each sprint, there should be a concrete, functional > deliverable. It may not do much, but what it does will work and have full > coverage accompanying unit tests. > > Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - > familiarization with and dev deployments of Hadoop and Mahout; reading up on > documentation, fine-tuning the project plan and requirements. This part will > kick into high gear after May 6 (my last final exam and final academic > obligation, prior to the actual graduation ceremony), but likely nothing > before April 29 (the day of my thesis defense). > > Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering > algorithm and parallelize on Mahout over Hadoop. Preliminary interface > allows for dataset selection and visualization of resulting clusters. > > Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral > clustering. Integrate map/reduce framework via Mahout and take advantage of > existing core calculation of transition matrices and associated eigenvectors > and eigenvalues. > > Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to > EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations. > > Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with > EigenCuts. Finalize interface for running the algorithm, selecting datasets > and visualizing results. > > Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit > tests, fix outstanding bugs. > > Other Information > > I am finishing up my last semester as a master's student in computational > biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at > Carnegie Mellon this coming fall. I have worked extensively with clustering > techniques as a master's student, as a significant amount of my work has > involved bioimage analysis within Dr Robert Murphy's lab. Previous work has > involved using HMMs to detect patterns in tuberculosis genomes and use > CLUSTALW to cluster those patterns according to geographic location. My > master's thesis involves use of matrix completion to infer linear > transformations of proteins and their associated subcellular locations > across varying cell conditions (drugs, cell lines, etc). > > I unfortunately have limited experience with Apache Mahout and Hadoop, > but with an undergraduate computer science degree from Georgia Tech, and > after an internship with IBM ExtremeBlue, I feel I am extremely adept at > picking up new frameworks quickly. > > References > > [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for > Spectral Clustering. NIPS 2002. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
