Just wondering if Mahout is compatible with hadoop-0.18 (and later) versions.
As in hadoop version 0.18 onwards, the combiner execution policy has changed
and now it gets executed twice - first from Mapper side (on the output of
Mapper) and then again on the Reducer side (on the output of first Combiner).
For more details: http://issues.apache.org/jira/browse/HADOOP-3226
<http://issues.apache.org/jira/browse/HADOOP-3226>
It seems me that the kmean and canopy clustering in Mahout assumes that the
combiner gets executed on Mapper side only and it's a major source of error, as
when the Combiner gets executed on the Reducer side, it can not parse the
output of first Combiner correctly.
To fix, only for hadoop-0.18.*, if you want to use combiner only on the output
of mapper (like earlier hadoop versions), add the following to your job config:
job.setCombineOnlyOnce(true);
This method (setCombineOnlyOnce) is not available in hadoop-0.19 release, so I
think Mahout code needs to be changed to take care of this issue.
Pradhuman