[
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201252#comment-13201252
]
Gaurav Redkar commented on MAHOUT-966:
--------------------------------------
Hello,
As Paritosh suggested, i tried specifying the -cl option while clustering. But
I am still experiencing the same problem. The number of members printed by the
clusterdumper code match the number of points generated by the
ClusterOutputPostProcessor for each cluster. Sadly this number does not match
the value 'n' for that cluster in the clusterdumper implementation.
Also while running the algorithm on a different dataset,the clustering
algorithm resulted in two clusters with the same cluster identifier..!! Also
that cluster contained some of the points twice. Any idea as to why is this
happening.?
The command used for performing the clustering job is :
bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 15
-cd 5 -t1 100 -t2 30 -cl -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o
output
i am attaching the dataset on which i tried the clustering. Kindly give your
suggestions on it.
> Mismantch in the number of points given by the clusterDumper and
> ClusterOutputPostProcessor
> -------------------------------------------------------------------------------------------
>
> Key: MAHOUT-966
> URL: https://issues.apache.org/jira/browse/MAHOUT-966
> Project: Mahout
> Issue Type: Bug
> Components: Integration
> Affects Versions: 0.6
> Environment: hadoop 0.20.2 mahout 0.6
> Reporter: Gaurav Redkar
> Priority: Minor
> Attachments: points100dCCNorm.txt
>
>
> After running the post processor the number of points that each cluster
> contains is not matching the number of points each cluster should contain as
> stated by clusterdumper.
>
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from
> the number of points actually contained in d directory for each cluster. Any
> idea why is this happening ...?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira