Canopy is a single-pass algorithm so producing multiple reducer output files 
would be an issue with that idea. With Canopy you need to adjust T1 and T2 (T3 
and T4 too) so that you get a manageable number of clusters from each mapper. 
The reducer phase just ensures that mapper outputs which are close together are 
merged.

Mean Shift is pretty much an iterative canopy implementation, hence its name: 
MeanShiftCanopy. If the number of reducers is greater than the number of 
iterations (the patch does not enforce this) then it will produce a single 
cluster file when it completes. Other values would terminate with multiple 
output files that might contain merge-able overlaps.

-----Original Message------
From: Elmer Garduno (JIRA) [mailto:[email protected]] 
Sent: Wednesday, June 29, 2011 12:30 PM
To: [email protected]
Subject: [jira] [Commented] (MAHOUT-749) MeanShift Cannot Use Multiple Reducers


    [ 
https://issues.apache.org/jira/browse/MAHOUT-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057416#comment-13057416
 ] 

Elmer Garduno commented on MAHOUT-749:
--------------------------------------

Hi Jeff, 

The CanopyDriver has the same problem, it also sets the numReducers=1 do you 
think that this kind of solution could also fix Canopy scalability issues?

> MeanShift Cannot Use Multiple Reducers
> --------------------------------------
>
>                 Key: MAHOUT-749
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-749
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-749.patch
>
>
> The MeanShiftCanopy clustering job sets the numReducers=1 and this severely 
> limits its scalability for larger jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to