[jira] [Issue Comment Edited] (MAHOUT-825) Canopies grouping records outside t1

Paritosh Ranjan (Issue Comment Edited) (JIRA) Thu, 06 Oct 2011 22:34:59 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122542#comment-13122542
 ]


Paritosh Ranjan edited comment on MAHOUT-825 at 10/7/11 5:33 AM:
-----------------------------------------------------------------

I experimented with the distance calculation, and now, I am using radius 
instead of the t1 parameter.

private boolean shouldCluster(Canopy canopy, Vector point) {

      if(clusterStrictly){

        Vector currentCenter = canopy.getCenter();

        double distance = measure.distance(currentCenter.getLengthSquared(), 
currentCenter, point);

        double radius = canopy.getRadius().getLengthSquared();

        return distance < radius*3 ;

      }

      return true;
  }

The positives and negatives of this approach are :

+ve : Its not dependent on t1. Radius is, I think a better way to calculate 
distances from canopies ( acoording to the discussion above ). I experienced, 
that the results are also "much better" than using t1. Some meaningful points, 
those were missed by using t1, are being clustered using this approach.

-ve : Now, I have no control on the quality of the cluster. The number, 3, is a 
constant. With t1, at least I was able to control the quality of the cluster.

I think, that

return distance < (radius*(1.5*t1)) will, at least give a control to the user 
on the quality of the output.

I am against implementing this as a post processing step, because it degrades 
the performance, as it adds one more step in computing clusters. And, canopy is 
supposed to be fast.

Thanks for the suggestions to improve it. I hope this implementation is better 
than the previous one. 

                
      was (Author: paritoshranjan):
    I experimented with the distance calculation, and now, I am using radius 
instead of the t1 parameter.

private boolean shouldCluster(Canopy canopy, Vector point) {
      if(clusterStrictly){
        Vector currentCenter = canopy.getCenter();
        double distance = measure.distance(currentCenter.getLengthSquared(), 
currentCenter, point);
        double radius = canopy.getRadius().getLengthSquared();
        return distance < radius*3 ;
      }
      return true;
  }

The positives and negatives of this approach are :

+ve : Its not dependent on t1. Radius is, I think a better way to calculate 
distances from canopies ( acoording to the discussion above ). I experienced, 
that the results are also "much better" than using t1. Some meaningful points, 
those were missed by using t1, are being clustered using this approach.

-ve : Now, I have no control on the quality of the cluster. The number, 3, is a 
constant. With t1, at least I was able to control the quality of the cluster.

I think, that

return distance < (radius*(1.5*t1)) will, at least give a control to the user 
on the quality of the output.

I am against implementing this as a post processing step, because it degrades 
the performance, as it adds one more step in computing clusters. And, canopy is 
supposed to be fast.

Thanks for the suggestions to improve it. I hope this implementation is better 
than the previous one. 

                  
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: Clustering Remote Points - Two Big, Useless 
> Clusters.txt, Not Clustering Remote Points - Two Meaningful Clusters.txt, 
> canopy-clusterFilter-t1, canopy-outside-t1-points-patch-1, 
> canopy-strict-clustering-flag
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-825) Canopies grouping records outside t1

Reply via email to