[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121106#comment-13121106
]
Paritosh Ranjan commented on MAHOUT-825:
----------------------------------------
Yes, canopy generation and clustering is separately done. As Jeff has already
mentioned.
Of course using clusterFilter instead of a new variable would be a overuse.
However, introducing a new parameter to a already 9-10 parameter long run
method of CanopyDriver won't make any sense. And since there is no existing
"concept" as such of canopy clustering. So, I don't think that a user control
over this is required.
Let me answer why do I think that clusterFilter and clustering strict canopies
go hand-in-hand.
When a user specifies clusterFilter > 0. He already has specified that he does
not wants canopies having single point ( or less than clusterFilter ). So, he
is more interested in better/closely grouped clusters and less interested in
isolated points. So, clustering remote points in this case will not help the
user. ( This was the exact problem where I was stuck, and created this patch,
while developing my application ).
And the closestCanopy can not be always that "close". It depends on the data.
So, sometimes, a really far placed point is also assigned to the canopy which,
has, otherwise, good quality/nearby points grouped data ( just because it was
the closest canopy to that isolated point ). This simply destroys the quality
of the cluster.
And the user gets this bad quality cluster because "every clustering algorithm
does it", which, in my view is not helping the user. And, Canopy Clustering is
not even a clustering algorithm, its just a means to find the approximate
number of clusters of size T1/T2, which can be used further in K-means ( all of
which is already done before clusterData phase, so this change won't effect
buildCluster phase ). Canopy Clustering is just a utility which helps the user
to get clusters easily with a high performance.
I hope I have answered all your queries.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira