[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13598990#comment-13598990
 ] 

Matt Molek commented on MAHOUT-1103:
------------------------------------

I've implemented an idea for this that works regardless of what the cluster ids 
are. It's in a separate class right now, but if you like the idea, I can 
refactor it as a patch to the current clusterpp code.

I made a class similar to 
o.a.m.clustering.topdown.postprocessor.ClusterCountReader, called 
ClusterIDReader. Given a clustering output directory, it reads all the cluster 
ids and returns a list containing the cluster ids.

To actually process the data, I run a MapReduce job over the clusterdPoints 
ouput directory. In the mapper's setup function, the list of clusters is 
obtained from ClusterIDReader, and a HashMap is constructed. The HashMap maps 
each cluster id to an int from 0 to numberOfClusters. The mapper emits each 
clustered vector with its new key 0 to k-1.

There are k reducers, one for each cluster, and each reducer uses 
ClusterIDReader to reverse the clusterID mapping that was created by the 
mappers. This allows the original cluster ids to be preserved at the end of the 
job. Once the job is done, the movePartFilesToRespectiveDirectories method 
works as before to move the part files to correctly named directories. 

Because the intermediate keys are guaranteed to be an unbroken sequence of ints 
from 0 to k-1, I think the hash partitioner will always send the vectors from 
each cluster to exactly one reducer (assuming there are k reducers).

Would you like a version of this as a patch?

 
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to