[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121106#comment-13121106
 ] 

Paritosh Ranjan commented on MAHOUT-825:
----------------------------------------

Yes, canopy generation and clustering is separately done. As Jeff has already 
mentioned.

Of course using clusterFilter instead of a new variable would be a overuse. 
However, introducing a new parameter to a already 9-10 parameter long run 
method of CanopyDriver won't make any sense. And since there is no existing 
"concept" as such of canopy clustering. So, I don't think that a user control 
over this is required.

Let me answer why do I think that clusterFilter and clustering strict canopies 
go hand-in-hand.

When a user specifies clusterFilter > 0. He already has specified that he does 
not wants canopies having single point ( or less than clusterFilter ). So, he 
is more interested in better/closely grouped clusters and less interested in 
isolated points. So, clustering remote points in this case will not help the 
user. ( This was the exact problem where I was stuck, and created this patch, 
while developing my application ).

And the closestCanopy can not be always that "close". It depends on the data. 
So, sometimes, a really far placed point is also assigned to the canopy which, 
has, otherwise, good quality/nearby points grouped data ( just because it was 
the closest canopy to that isolated point ). This simply destroys the quality 
of the cluster. 

And the user gets this bad quality cluster because "every clustering algorithm 
does it", which, in my view is not helping the user. And, Canopy Clustering is 
not even a clustering algorithm, its just a means to find the approximate 
number of clusters of size T1/T2, which can be used further in K-means ( all of 
which is already done before clusterData phase, so this change won't effect 
buildCluster phase ). Canopy Clustering is just a utility which helps the user 
to get clusters easily with a high performance. 

I hope I have answered all your queries. 
                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to