Re: syntheticcontroldata clustering example failure due to combiner

Jeff Eastman Wed, 10 Jun 2009 16:30:59 -0700

Synthetic Control actually used to work with all the clustering jobs.The move to Hadoop 0.19 introduced intermittent problems that dependupon optimizations done behind the scenes in Hadoop. All of the originalimplementations used combiners under the assumption that they would onlyrun after the mapper and they would run exactly once. These assumptionschanged in 0.19. M-99 fixed K-Means but not Canopy or Mean Shift whichstill have these assumptions.

Unfortunately, the combiner seems to run only once and only with themappers in the development mode which is used by the build and all theunit tests. This caused the severity of the semantics change to remainundetected until recently when users are trying to run clustering onreal Hadoop clusters.

The only solution I can imagine right now is to move the combinercentroid summation code back into the mappers and have the mappersoutput fully combined data during close(). It is not very elegant,perhaps someone has a better solution in mind. I will take a look at ittonight after the Hadoop Summit.


Jeff

Adil Aijaz wrote:

Hi folks,
I am new to mahout and I started exploring mahout 0.1 release bytrying to run the kmeans clustering example as described inhttp://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
After a bunch of runs where no matter what parameters I specified, theoutput never changed I realized that:
1. KMeans was clustering all 600 points of syntheticcontroldata intoone cluster.
2. There is a bug inexamples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.javathat called runJob from main function with my provided argumentstransposed. So, my convergenceDelta was interpreted as t1, t1 as t2,and t2 as convergenceDelta. I will commit a patch as soon as I getapproval for opensource commits from my employer, however, I thoughtI'd put it out there in case someone else is going through the sameissue.
As for the more serious issue#1 (kmeans clustering everything into onecluster), I found that this is because the CanopyClusteringJob wasgenerating only one canopy. Digging deeper, I found that this problemwas coming from the CanopyCombiner being run in both map & reducephases. From there I discovered this post from december 2008:
http://tinyurl.com/l83ff4
which indicates that from hadoop 0.18 onwards the combiner will be runin both map and reduce which is bad since the CanopyCombiner andKMeansCombiner assume that they are executed only on map side. Now,the suggested workaround is specific to hadoop 0.18 and it doesn'twork with mahout-0.1 since it requires hadoop 0.19. This means a codefix is needed for this issue. From the thread Grant talks about apatch (MAHOUT-99) that fixes the code but that patch is already partof mahout-0.1 and so it apparently does not fix the issue.
All that to say, I haven't been able to get the kmeans clusteringexample on syntheticdata to work which is a bummer. My questions are:
1) Are there any open jiras on this issue (I didn't find any) ? If no,should I create one?
2) Any workarounds for now?


Adil

Re: syntheticcontroldata clustering example failure due to combiner

Reply via email to