[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Matt Molek (JIRA) Sun, 02 Jun 2013 19:30:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672760#comment-13672760
 ]


Matt Molek edited comment on MAHOUT-1103 at 6/3/13 2:28 AM:
------------------------------------------------------------

{quote}
This seems like a flat out bug in the ClusterPP, since it says it is supposed 
to write separate directories, so it doesn't seem to me like we need to add new 
classes here, but instead should fix the bug.
{quote}

Well yes, it is a bug. I've reproduced it on a real cluster (that's what lead 
me to origianlly post this jira). The problem is that the distributed clusterpp 
job assumes that the hash partitioner will correctly distribute the cluster 
ids, one to each reducer. That would only happen in the situation where the 
clusters are numbered 1 to k or some other convenient numbering. That is 
rarely, if ever, the case.

The only way I could think to get this working is to temporarily remap the 
cluster ids to a more convenient numbering that would play well with the hash 
partitioner. See my earlier comment for the exact way I went about that. I 
don't think any small tweaks will fix the current distributed code. As far as I 
can tell, you either need to temporarily change the cluster numbering, or write 
some new partitioner (and I can't think of a way to do it with a paritioner). 
Maybe there's some third option, but I can't think of one.

I'm happy to try coming up with a patch for the way I've solved it, if you want 
to go about it that way.
                
      was (Author: mmolek):
    {quote}This seems like a flat out bug in the ClusterPP, since it says it is 
supposed to write separate directories, so it doesn't seem to me like we need 
to add new classes here, but instead should fix the bug.

Well yes, it is a bug. I've reproduced it on a real cluster (that's what lead 
me to origianlly post this jira). The problem is that the distributed clusterpp 
job assumes that the hash partitioner will correctly distribute the cluster 
ids, one to each reducer. That would only happen in the situation where the 
clusters are numbered 1 to k or some other convenient numbering. That is 
rarely, if ever, the case.

The only way I could think to get this working is to temporarily remap the 
cluster ids to a more convenient numbering that would play well with the hash 
partitioner. See my earlier comment for the exact way I went about that. I 
don't think any small tweaks will fix the current distributed code. As far as I 
can tell, you either need to temporarily change the cluster numbering, or write 
some new partitioner (and I can't think of a way to do it with a paritioner). 
Maybe there's some third option, but I can't think of one.

I'm happy to try coming up with a patch for the way I've solved it, if you want 
to go about it that way.
                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Grant Ingersoll
>              Labels: clusterpp
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Reply via email to