[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481544#comment-13481544 ]
Paritosh Ranjan commented on MAHOUT-1103: ----------------------------------------- Instead of changing the nomenclature of clusters, we should create a partitioner which solves the problem ( if the problem is in partitioner ). By changing the nomenclature of the clusters, we will be fixing the bug just for this particular case. In case, someone creates as many as 3742464 clusters ( assume ), or more, then again the same problem will rise. So, we should fix it at a generic level. Looking forward for your tests without SVD. As I said earlier, also try to use a better Partitioner. The partitioner can be changed in postProcessMR method of ClusterOutputPostProcessorDriver class. You can read this page https://cwiki.apache.org/MAHOUT/how-to-contribute.html to see how to get code and generate patches. > clusterpp is not writing directories for all clusters > ----------------------------------------------------- > > Key: MAHOUT-1103 > URL: https://issues.apache.org/jira/browse/MAHOUT-1103 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Reporter: Matt Molek > Assignee: Paritosh Ranjan > Labels: clusterpp > > After running kmeans clustering on a set of ~3M points, clusterpp fails to > populate directories for some clusters, no matter what k is. > I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 > Even with k=2 only one cluster directory was created. For each reducer that > fails to produce directories there is an empty part-r-* file in the output > directory. > Here is my command sequence for the k=2 run: > {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o > 2clusters/pca-clusters -dm > org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 > -cl > bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o > 2clusters.txt > bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} > The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 > containing 2585843 and 1156624 points respectively. > Discussion on the user mailing list suggested that this might be caused by > the default hadoop hash partitioner. The hashes of these two clusters aren't > identical, but they are close. Putting both cluster names into a Text and > caling hashCode() gives: > VL-3742464 -> -685560454 > VL-3742466 -> -685560452 > Finally, when running with "-xm sequential", everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira