[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Molek updated MAHOUT-1103:
-------------------------------

    Attachment: MAHOUT-1103.patch

I've been held up with some local problems with running tests. When building 
mahout with testing enabled, I'm getting lots of out of memory errors that I 
haven't figured out yet. This is happening to me on a clean checkout of the 
trunk, so it's nothing I've modified. It must just be something weird with my 
local environment.

So, apologies for not being able to fully test this. It does build with 
-DskipTests=true though, and it worked fine when testing it on some real data 
just now.

As I was typing this up I just remembered that I changed the keys from Texts to 
IntWritables, since int is the only type of ID a ClusterWritable can have. That 
probably makes the map/reduce implementation inconsistent with the way the 
sequential method does it though. To get identical output to the sequential 
method, the reducer just needs to output a Text with the cluster id, instead of 
an IntWritable with the cluster id like is does in my patch.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Grant Ingersoll
>              Labels: clusterpp
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to