mahout & hadoop compatibility

Pradhuman Jhala Thu, 04 Dec 2008 17:12:50 -0800

 
Just wondering if Mahout is compatible with hadoop-0.18 (and later) versions.  
As in hadoop version 0.18 onwards, the combiner execution policy has changed 
and now it gets executed  twice - first from Mapper side (on the output of  
Mapper) and then again on the Reducer side (on the output of first Combiner). 
 
For more details: http://issues.apache.org/jira/browse/HADOOP-3226 
<http://issues.apache.org/jira/browse/HADOOP-3226> 
 
It seems me that the kmean and canopy clustering in Mahout assumes that the 
combiner gets executed on Mapper side only and it's a major source of error, as 
when the Combiner gets executed on the Reducer side, it can not parse the 
output of first Combiner correctly. 
 
To fix, only for hadoop-0.18.*, if you want to use combiner only on the output 
of mapper (like earlier hadoop versions), add the following to your job config:
 
job.setCombineOnlyOnce(true); 
  
This method (setCombineOnlyOnce) is not available in hadoop-0.19 release, so I 
think Mahout code needs to be changed to take care of this issue. 
 
Pradhuman

mahout & hadoop compatibility

Reply via email to