[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121955#comment-13121955
 ] 

Paritosh Ranjan edited comment on MAHOUT-825 at 10/6/11 1:54 PM:
-----------------------------------------------------------------

I have attached two versions of clustered points ( clusters ). This is real 
data which my application was clustering. Opening this file in Notepad++, 
EditPlus etc. will show you the results in better format. I have also printed 
Vectors, but you can only see the Strings in each line. The functionality is, 
that, similar looking Strings should be clustered together.

One file ( Clustering Remote Points - Two Big, Useless Clusters ) contains two 
big clusters ( Scroll down to see the second cluster ). This file is created 
using cluster all points approach.

Second File ( Not Clustering Remote Points - Two Meaningful Clusters ) contains 
points clustered only within t1. 

I hope you will agree that the second file with two small clusters makes sense 
in this case.

Now I have used a flag to cluster data Strictly (instead of clusterFilter) . I 
have created a patch and attached ( canopy-strict-clustering-flag ). It also 
has test cases which demonstrate how remotely present points can be added or 
declined from a canopy (cluster).

The cluster files that I have attached, are, the exact use case, for which, I 
fixed this issue in Mahout. I encountered this problem, and I was clueless of 
what is happening inside. I kept checking my vector creation and distance 
measures, but the problem was not there. So, I think this patch can help 
others. Also, the default option is false, so, it can be used as per user's 
requirement.
                
      was (Author: paritoshranjan):
    I have attached two versions of clustered points ( clusters ). This is real 
data which my application was clustering. Opening this file in Notepad++, 
EditPlus etc. will show you the results in better format. I have also printed 
Vectors, but you can only see the Strings in each line. The functionality is, 
that, similar looking Strings should be clustered together.

One file ( Clustering Remote Points - Two Big, Useless Clusters ) contains two 
big clusters ( Scroll down to see the second cluster ). This file is created 
using cluster all points approach.

Second File ( Not Clustering Remote Points - Two Meaningful Clusters ) contains 
points clustered only withing t1. 

I hope you will agree that the second file with two small clusters makes sense 
in this case.

Now I have used a flag to cluster data Strictly (instead of clusterFilter) . I 
have created a patch and attached ( canopy-strict-clustering-flag ). It also 
has test cases which demonstrate how remotely present points can be added or 
declined from a canopy (cluster).

The cluster files that I have attached, are, the exact use case, for which, I 
fixed this issue in Mahout. I encountered this problem, and I was clueless of 
what is happening inside. I kept checking my vector creation and distance 
measures, but the problem was not there. So, I think this patch can help 
others. Also, the default option is false, so, it can be used as per user's 
requirement.
                  
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: Clustering Remote Points - Two Big, Useless 
> Clusters.txt, Not Clustering Remote Points - Two Meaningful Clusters.txt, 
> canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1, 
> canopy-strict-clustering-flag
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to