Re: [jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Shannon Quinn Tue, 17 Aug 2010 06:07:09 -0700

> > Anything but sparse connectivity is a complete non-starter for a scalable
> > system.
>
> Right, thats why I don't like the pairwise computation approach.
>
> In this case, the document needs to be paired with the nearest cluster
> right, something like Canopy clustering should give partial connection
> graph
> ?
>
>
At least in all the reading I've done, the affinity graph is extremely
sparse; each point is often connected only to a neighborhood of 8-10 other
points (out of millions), and all other connections are set to 0 (or no
connection).



A complete spectral clustering package should take an input set of
> documents, create the matrix and run clustering and output the clusters. To
> get an idea of your work till now, what are the blocks missing from this
> ideal package scenario?
>
>
The missing block is the creation of the matrix; right now the assumption is
that the user supplies their own affinity data, and the algorithm runs from
that point. I'd need to add a component which can handle the raw data and
create a sparse affinity matrix from it. In this specific case - the Reuters
dataset - a method for comparing documents.

Re: [jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Reply via email to