[ 
https://issues.apache.org/jira/browse/MAHOUT-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1330 started by Suneel Marthi.

> Unable to do K-means clustering on Reuters dataset
> --------------------------------------------------
>
>                 Key: MAHOUT-1330
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1330
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>         Environment: Linux
>            Reporter: Karthik Prakhya
>            Assignee: Suneel Marthi
>             Fix For: 0.9
>
>         Attachments: MyAnalyzer.java, NewsKMeansClustering-output.txt, 
> NewsKMeansClustering.java, df-count.txt, frequency-file.txt, 
> reuters-seqfiles.zipx, test-kmeans-clustering-reuters-java-api.sh, 
> tfidf-vectors.txt
>
>
> The attached code uses the Mahout API to do k-means clustering on the Reuters 
> dataset and generates the initial centroids using the canopy algorithm. The 
> parameters are exactly the same as the ones in the Scala example presented in 
> the following link:
> http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html
> The code compiles without an error, but the K-means algorithm cannot initiate 
> because the initial centroids are not being generated. This in turn is due to 
> the fact that the TF-IDF vectors are not being generated.
> Considering that this code compiles and is based on earlier Scala code that 
> worked, it is suggestive that there is a bug in the Mahout source code that 
> may need fixing. I thought I should bring it to your attention.
> I have attached the source code, the included JAR files and the shell script 
> (called test-kmeans-clustering-reuters-java-api.sh) to compile and run the 
> code. The output of the shell script is located in 
> NewsKMeansClustering-output.txt. Please note that you may need to change the 
> path (see environmental variable JARPATH) to the JAR files in the shell 
> script based on where you put the JARs. I also attached the output of 
> clusterdump utility in the form of .txt files for the intermediate outputs of 
> my code such as the TF vectors and TF-IDF vectors (see tf-vectors.txt, 
> tfidf-vectors.txt, df-count.txt and frequency-file.txt).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to