[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120666#comment-13120666
]
Paritosh Ranjan commented on MAHOUT-825:
----------------------------------------
As Sean mentioned,
"This canopy-generation business is a heuristic speed-up. I would not expect it
to mean some things don't get clustered."
These two points can not be achieved together in canopy generation. Clustering
every point makes canopy generation process really really slow ( as all
canopies are processed on a "single" reducer in buildCluster phase).
To get rid of this performance problem, the number of canopies were controlled
( based on their quality) using a variable clusterFilter ( introduced a week
ago ). This is an int variable and it prevents formation of canopies having
less than clusterFilter points ( which, drastically improves the performance of
the reducer phase of buildCluster, as the number of canopies generated are less
). In a sense, this variable (clusterFilter) tells whether the user wants every
point to be inside a cluster or not. ( See code snippet )
if (canopy.getNumPoints() > clusterFilter) {
context.write(new Text(canopy.getIdentifier()), canopy);
}
Setting this varaible clusterFilter > 0 implies that the user is only
interested in clusters which have points > clusterFilter, and is not interested
in remote/single/few isolated points. Which also means, that, he does not want
every point to be clustered.
So, we can have an additional check of ( clusterFilter > 0 ) in this patch,
which, if true, implies that, the user is not interested in clustering every
point. He is more interested in the quality of the clusters. Then, this patch
will give hime the result he desires.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira