[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119400#comment-13119400
]
Jeff Eastman commented on MAHOUT-825:
-------------------------------------
Canopy is intended to be a fast, approximate clustering algorithm. The Mahout
sequential implementation runs a single pass over the data to produce
approximate cluster centers. The mapreduce implementation runs one pass in each
mapper and another pass in the reducer, to combine the results from the various
mappers. The clusters produced by the sequential and mapreduce implementations
will be different as a result.
Once cluster centers are determined, the classification (clustering) of points
follows a maximum-likelihood method which assigns each point to the closest
cluster. This proposed patch modifies that method to impose an additional
(d<T1) criteria on cluster assignment. This can result in some of the input
points not being classified at all. I don't view this as a step in the right
direction, nor do I think this is an incorrect result.
-1 I'm inclined to reject this patch for these reasons.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira