Canopy is a single-pass algorithm so producing multiple reducer output files would be an issue with that idea. With Canopy you need to adjust T1 and T2 (T3 and T4 too) so that you get a manageable number of clusters from each mapper. The reducer phase just ensures that mapper outputs which are close together are merged.
Mean Shift is pretty much an iterative canopy implementation, hence its name: MeanShiftCanopy. If the number of reducers is greater than the number of iterations (the patch does not enforce this) then it will produce a single cluster file when it completes. Other values would terminate with multiple output files that might contain merge-able overlaps. -----Original Message------ From: Elmer Garduno (JIRA) [mailto:[email protected]] Sent: Wednesday, June 29, 2011 12:30 PM To: [email protected] Subject: [jira] [Commented] (MAHOUT-749) MeanShift Cannot Use Multiple Reducers [ https://issues.apache.org/jira/browse/MAHOUT-749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057416#comment-13057416 ] Elmer Garduno commented on MAHOUT-749: -------------------------------------- Hi Jeff, The CanopyDriver has the same problem, it also sets the numReducers=1 do you think that this kind of solution could also fix Canopy scalability issues? > MeanShift Cannot Use Multiple Reducers > -------------------------------------- > > Key: MAHOUT-749 > URL: https://issues.apache.org/jira/browse/MAHOUT-749 > Project: Mahout > Issue Type: Improvement > Reporter: Jeff Eastman > Assignee: Jeff Eastman > Attachments: MAHOUT-749.patch > > > The MeanShiftCanopy clustering job sets the numReducers=1 and this severely > limits its scalability for larger jobs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
