> > Anything but sparse connectivity is a complete non-starter for a scalable > > system. > > Right, thats why I don't like the pairwise computation approach. > > In this case, the document needs to be paired with the nearest cluster > right, something like Canopy clustering should give partial connection > graph > ? > > At least in all the reading I've done, the affinity graph is extremely sparse; each point is often connected only to a neighborhood of 8-10 other points (out of millions), and all other connections are set to 0 (or no connection).
A complete spectral clustering package should take an input set of > documents, create the matrix and run clustering and output the clusters. To > get an idea of your work till now, what are the blocks missing from this > ideal package scenario? > > The missing block is the creation of the matrix; right now the assumption is that the user supplies their own affinity data, and the algorithm runs from that point. I'd need to add a component which can handle the raw data and create a sparse affinity matrix from it. In this specific case - the Reuters dataset - a method for comparing documents.
