That runs the risk of forcing a replication of the canopy structure. These techniques: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/
might be more useful. On Mon, Aug 16, 2010 at 7:12 AM, Robin Anil <[email protected]> wrote: > > > > > > In this case, the document needs to be paired with the nearest cluster > > right, something like Canopy clustering should give partial connection > > graph > > > > Just populate similarity values for documents in a canopy, very sparse but > still connected graph due to the overlapping nature of canopy clustering > > > > > Robin > > > > > > > > > >> On Mon, Aug 16, 2010 at 7:00 AM, Robin Anil <[email protected]> > >> wrote: > >> > >> > From a GSOC angle, it needn't be done, its upto your mentor to decide. > I > >> am > >> > interested more in getting this completed and pushed out so that > people > >> can > >> > really use it. If you can spare time after GSOC and still hang around > >> the > >> > community and help in getting this polished, it will be great. > >> > > >> > To create your pairwise similarity(0-1 1 means dissimilar) matrix(it > >> can > >> > be > >> > the other way around as well), see the DistanceMeasure > implementations. > >> > Creating the pairwise matrix is non trivial from a scalability stand > >> point. > >> > > >> > A complete spectral clustering package should take an input set of > >> > documents, create the matrix and run clustering and output the > clusters. > >> To > >> > get an idea of your work till now, what are the blocks missing from > this > >> > ideal package scenario? > >> > > >> > > >> > Robin > >> > > >> > > > > >
