Jeff, I'm an old dog who has been taught a certain number of machine learning new tricks. There's a common thread to the Canopy and KMeans code that has me doing a certain amount of head-scratching.
The Canopy class doesn't keep a reference to the points in the canopy. But someone must. Or is canopy membership restested in emitPointsToNewCanopies? --benson On Sat, May 30, 2009 at 8:22 PM, Jeff Eastman <j...@windwardsolutions.com>wrote: > I think you are actually correct about the reference implementation that is > used in the tests and that example. I was looking at the > Canopy.addPointToCanopies() method which does add a new canopy if there are > none that are strongly bound (<t2). Good eyes, do you want to suggest a fix? > > Jeff > > > > Benson Margulies wrote: > >> I'll look at the copy in DisplayKMeans again and see if it is missing that >> last test. >> >> On Sat, May 30, 2009 at 12:41 PM, Jeff Eastman >> <j...@windwardsolutions.com>wrote: >> >> >> >>> Canopy tests each point against the current set of canopies, adding the >>> point to each canopy that is within t1 and finally stopping when it finds >>> one within t2. If all canopies are tested and none are within t2 then a >>> new >>> canopy is added with the point as its center. So, even if you set t1 and >>> t2 >>> badly, the worst you will see is one canopy for each point in the data >>> set. >>> >>> >>> Benson Margulies wrote: >>> >>> >>> >>>> I've looked at the implementation of Canopy in DisplayKMeans, and then >>>> tried >>>> to compare it to the MapReduce version. >>>> >>>> I'm sure that the simple version in DisplayKMeans has the potential to >>>> loop >>>> indefinitely. I can't prove that the problem exists in the real >>>> map/reduce >>>> code one way or the other. >>>> >>>> The 'problem' is this. If you make bad choices for t1 and t2, the >>>> algorithm >>>> doesn't terminate. If no points make it inside of t2, the process never >>>> stops. Since the range of distances depends entirely on the vectors and >>>> the >>>> distance metric, there's no way without pre-processing to ensure that >>>> the >>>> values avoid a loop. >>>> >>>> Now, McCallum et. al. doesn't formally present the algorithm with t1 and >>>> t2. >>>> They describe it informally as one example of their main idea: initial >>>> partition with a cheap distance metric followed by detailed clustering >>>> using >>>> an expensive one. It is possible that the 'cross-validate' that they >>>> mention >>>> and which is copied into the comments in Mahout avoids this problem by >>>> estimating t1 and t2 from a survey of the data. >>>> >>>> It might be a good idea to implement a cross validation capability, or >>>> at >>>> least a maximum iteration limit. >>>> >>>> All this is conditional on the notion that the MapReduce version in the >>>> end >>>> has the same 'iterate until cooked' . The code in addPointToCanopies >>>> looks to have the same potential, as it could fail to strongly bind >>>> enough >>>> points to run out of thing to do. Or not. I expect to learn that I don't >>>> understand this well enough yet. >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> > >