[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672836#comment-13672836
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-----------------------------------------

bq. Well yes, it is a bug. I've reproduced it on a real cluster (that's what 
lead me to origianlly post this jira)

:-)  Yeah, just confirming it.  We get a lot of non-bugs reported.  I wonder if 
we used to just sequentially dole out cluster ids and that changed w/ the 
clustering refactoring.

{quote}That would only happen in the situation where the clusters are numbered 
1 to k or some other convenient numbering. That is rarely, if ever, the case.
The only way I could think to get this working is to temporarily remap the 
cluster ids to a more convenient numbering that would play well with the hash 
partitioner{quote}

I don't know a lot about partitioners just yet and that makes me think they 
might be heavy handed here, but it occurs to me that we can take advantage of 
that the number of clusters is small and during setup simply load up the 
cluster id map and create the "convenient numbering" for writing during the 
reduce phase to 0 - n-1 (where n is the number of clusters).

Then, in the {code}movePartFilesToRespectiveDirectories{code} we should get 
renamed appropriately.

Would that work?

                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Grant Ingersoll
>              Labels: clusterpp
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to