[
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673071#comment-13673071
]
Grant Ingersoll commented on MAHOUT-1103:
-----------------------------------------
Matt, out of curiosity, what's your use case for the clusterpp? [~robinanil]
and I are both looking at this code and wondering why it is useful to separate
out the clusters into their own directory. MAHOUT-843 doesn't shed any light
on it for us either.
Also, I don't think the current patch partitions correctly. For instance, try
a numPartitions of 2 and cluster ids of 153 and 53. Then, 10^1 means you get
153 % 10 and 53 % 10 both = 3 and you have a collision. So, I think I'm back
to my original thought, which is in the mappers and reducers, we need to load
up the cluster ids and just map it there.
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Matt Molek
> Assignee: Grant Ingersoll
> Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that
> fails to produce directories there is an empty part-r-* file in the output
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o
> 2clusters/pca-clusters -dm
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat}
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by
> the default hadoop hash partitioner. The hashes of these two clusters aren't
> identical, but they are close. Putting both cluster names into a Text and
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira