[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673119#comment-13673119
 ] 

Matt Molek commented on MAHOUT-1103:
------------------------------------

{quote}
Yeah, just confirming it. We get a lot of non-bugs reported. I wonder if we 
used to just sequentially dole out cluster ids and that changed w/ the 
clustering refactoring.
{quote}

Sorry, no problem. I just wanted to make sure nothing was getting overlooked 
since this thread is getting rather long.

{quote}
which is in the mappers and reducers, we need to load up the cluster ids and 
just map it there.
{quote}

That's exactly what I've gotten working on my own project. It's not submitted 
here yet as a patch because the first version of it that I made was just to see 
if it would work, and isn't in mahout's code style. I think there was a 
different earlier patch from Paritosh which is the patch currently attached. My 
code is pretty simple. I can submit a patch in the next couple of days once I 
find a little free time.

{quote}
Matt, out of curiosity, what's your use case for the clusterpp? Robin Anil and 
I are both looking at this code and wondering why it is useful to separate out 
the clusters into their own directory.
{quote}

For me, the value in separating the clusters out into their own directories is 
that it makes it very easy to lauch further mahout jobs against the contents of 
an individual cluster. I cluster, separate the results, and then launch new 
jobs against each clusterpp output directory. I find it pretty useful.


                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Grant Ingersoll
>              Labels: clusterpp
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to