[
https://issues.apache.org/jira/browse/MAHOUT-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on MAHOUT-1330 started by Suneel Marthi.
> Unable to do K-means clustering on Reuters dataset
> --------------------------------------------------
>
> Key: MAHOUT-1330
> URL: https://issues.apache.org/jira/browse/MAHOUT-1330
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.8
> Environment: Linux
> Reporter: Karthik Prakhya
> Assignee: Suneel Marthi
> Fix For: 0.9
>
> Attachments: MyAnalyzer.java, NewsKMeansClustering-output.txt,
> NewsKMeansClustering.java, df-count.txt, frequency-file.txt,
> reuters-seqfiles.zipx, test-kmeans-clustering-reuters-java-api.sh,
> tfidf-vectors.txt
>
>
> The attached code uses the Mahout API to do k-means clustering on the Reuters
> dataset and generates the initial centroids using the canopy algorithm. The
> parameters are exactly the same as the ones in the Scala example presented in
> the following link:
> http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html
> The code compiles without an error, but the K-means algorithm cannot initiate
> because the initial centroids are not being generated. This in turn is due to
> the fact that the TF-IDF vectors are not being generated.
> Considering that this code compiles and is based on earlier Scala code that
> worked, it is suggestive that there is a bug in the Mahout source code that
> may need fixing. I thought I should bring it to your attention.
> I have attached the source code, the included JAR files and the shell script
> (called test-kmeans-clustering-reuters-java-api.sh) to compile and run the
> code. The output of the shell script is located in
> NewsKMeansClustering-output.txt. Please note that you may need to change the
> path (see environmental variable JARPATH) to the JAR files in the shell
> script based on where you put the JARs. I also attached the output of
> clusterdump utility in the form of .txt files for the intermediate outputs of
> my code such as the TF vectors and TF-IDF vectors (see tf-vectors.txt,
> tfidf-vectors.txt, df-count.txt and frequency-file.txt).
--
This message was sent by Atlassian JIRA
(v6.1#6144)