[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Sean Owen (JIRA) Thu, 06 Jun 2013 12:10:13 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677389#comment-13677389
 ]


Sean Owen commented on MAHOUT-1103:
-----------------------------------

512MB of heap can mean quite a bit more than that used by the JVM. Thread 
stacks, native code memory allocations, buffers, and other stuff adds up to 
maybe 25% more in overhead. Do you have swap off? then all your other OS stuff 
is resident in RAM too and that could take a lot.

There's some argument to turn down fork to one per core, although that probably 
leaves cores underutilized as many JVMs will be waiting on I/O at any given 
time. With 1C, core tests take 10.5 minutes for me (2 cores, 4 virtual cores). 
It's 9.3 minutes with 1.5C.

Hmm, is it better to turn this down to make sure the tests run out of the box 
for more people, and leave it to those with big machines to manually tune this 
upwards? Or vice versa. The speed difference is about 15%. The number of people 
for whom it would fail now is... I don't know. It failed for 3 of us before my 
change, now down to 1. Might be reasonable to think 5-10% of users won't be 
able to run the tests as is?
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Grant Ingersoll
>              Labels: clusterpp
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Reply via email to