[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120666#comment-13120666
 ] 

Paritosh Ranjan commented on MAHOUT-825:
----------------------------------------

As Sean mentioned, 

"This canopy-generation business is a heuristic speed-up. I would not expect it 
to mean some things don't get clustered."

These two points can not be achieved together in canopy generation. Clustering 
every point makes canopy generation process really really slow ( as all 
canopies are processed on a "single" reducer in buildCluster phase).

To get rid of this performance problem, the number of canopies were controlled 
( based on their quality) using a variable clusterFilter ( introduced a week 
ago ). This is an int variable and it prevents formation of canopies having 
less than clusterFilter points ( which, drastically improves the performance of 
the reducer phase of buildCluster, as the number of canopies generated are less 
). In a sense, this variable (clusterFilter) tells whether the user wants every 
point to be inside a cluster or not. ( See code snippet )

 if (canopy.getNumPoints() > clusterFilter) {
        context.write(new Text(canopy.getIdentifier()), canopy);
      }

Setting this varaible clusterFilter > 0 implies that the user is only 
interested in clusters which have points > clusterFilter, and is not interested 
in remote/single/few isolated points. Which also means, that, he does not want 
every point to be clustered.

So, we can have an additional check of ( clusterFilter > 0 ) in this patch, 
which, if true, implies that, the user is not interested in clustering every 
point. He is more interested in the quality of the clusters. Then, this patch 
will give hime the result he desires.



                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: canopy-outside-t1-points-patch-1
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to