Jeff,

I'm an old dog who has been taught a certain number of machine learning new
tricks. There's a common thread to the Canopy and KMeans code that has me
doing a certain amount of head-scratching.

The Canopy class doesn't keep a reference to the points in the canopy. But
someone must. Or is canopy membership restested in emitPointsToNewCanopies?

--benson


On Sat, May 30, 2009 at 8:22 PM, Jeff Eastman <j...@windwardsolutions.com>wrote:

> I think you are actually correct about the reference implementation that is
> used in the tests and that example. I was looking at the
> Canopy.addPointToCanopies() method which does add a new canopy if there are
> none that are strongly bound (<t2). Good eyes, do you want to suggest a fix?
>
> Jeff
>
>
>
> Benson Margulies wrote:
>
>> I'll look at the copy in DisplayKMeans again and see if it is missing that
>> last test.
>>
>> On Sat, May 30, 2009 at 12:41 PM, Jeff Eastman
>> <j...@windwardsolutions.com>wrote:
>>
>>
>>
>>> Canopy tests each point against the current set of canopies, adding the
>>> point to each canopy that is within t1 and finally stopping when it finds
>>> one within t2. If all canopies are tested and none are within t2 then a
>>> new
>>> canopy is added with the point as its center. So, even if you set t1 and
>>> t2
>>> badly, the worst you will see is one canopy for each point in the data
>>> set.
>>>
>>>
>>> Benson Margulies wrote:
>>>
>>>
>>>
>>>> I've looked at the implementation of Canopy in DisplayKMeans, and then
>>>> tried
>>>> to compare it to the MapReduce version.
>>>>
>>>> I'm sure that the simple version in DisplayKMeans has the potential to
>>>> loop
>>>> indefinitely. I can't prove that the problem exists in the real
>>>> map/reduce
>>>> code one way or the other.
>>>>
>>>> The 'problem' is this. If you make bad choices for t1 and t2, the
>>>> algorithm
>>>> doesn't terminate. If no points make it inside of t2, the process never
>>>> stops. Since the range of distances depends entirely on the vectors and
>>>> the
>>>> distance metric, there's no way without pre-processing to ensure that
>>>> the
>>>> values avoid a loop.
>>>>
>>>> Now, McCallum et. al. doesn't formally present the algorithm with t1 and
>>>> t2.
>>>> They describe it informally as one example of their main idea: initial
>>>> partition with a cheap distance metric followed by detailed clustering
>>>> using
>>>> an expensive one. It is possible that the 'cross-validate' that they
>>>> mention
>>>> and which is copied into the comments in Mahout avoids this problem by
>>>> estimating t1 and t2 from a survey of the data.
>>>>
>>>> It might be a good idea to implement a cross validation capability, or
>>>> at
>>>> least a maximum iteration limit.
>>>>
>>>> All this is conditional on the notion that the MapReduce version in the
>>>> end
>>>> has the same 'iterate until cooked' . The code in     addPointToCanopies
>>>> looks to have the same potential, as it could fail to strongly bind
>>>> enough
>>>> points to run out of thing to do. Or not. I expect to learn that I don't
>>>> understand this well enough yet.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Reply via email to