Hi everyone, Small update about Streaming KMeans: a first version of the map reduce code is ready in [1] (the mapreduce branch). It seems to be working for synthetic data so I'm now starting to look at real data.
So, I talked with Ted and we decided the first thing to test on is clustering the 20 newsgroups data set. I used [2] (seqdirectory and seq2sparse) to create a sequence file (Text, Text) with entries for each file in the 20 newsgroups data set. It seemed reasonable to use DictionaryVectorized (although Ted suggested using AdaptiveValueEncoder for creating the vectors) for TF-IDF for scoring. The vectors I get have > 90000 features. So, I projected these vectors on to 50 uniformly distributed normalized vectors. So, I create a dimension 50 vector for each document. I then ran Streaming KMeans and then clustered the resulting centroids with Ball KMeans down to 20 groups (the actual number of clusters). That code is here [3]. The thing is the resulting clusters (after KMeans) are extremely uneven and the final clusters are very wrong (there is a final cluster with > 5000 points) even though in reality the maximum real cluster is of size 1000. I now realized that I haven't stripped the data of the headers, but I doubt this might cause the brokenness I'm seeing. Are the steps reasonable? In which case one of the k-means steps is probably wrong? Or am I doing something illogical? Thanks! :) [1] https://github.com/dfilimon/knn/tree/mapreduce/src/main/java/org/apache/mahout/knn/experimental [2] https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html [3] https://github.com/dfilimon/knn/blob/mapreduce/src/test/java/org/apache/mahout/knn/experimental/EvaluateClustering.java
