I updated the code to compare the ball k-means clusters to the ones generated using streaming k-means + ball k-means. They're both wrong! :(
This is the updated version [1]. And the output is: Total number of vectors 18828 Clustering with BallKMeans 0: 3978 1: 5 2: 1859 3: 31 4: 102 5: 14 6: 17 7: 382 8: 1 9: 2946 10: 1096 11: 21 12: 12 13: 4128 14: 70 15: 148 17: 237 16: 3139 19: 634 18: 8 Clustering with StreamingKMeans 0: 2 1: 10711 2: 149 3: 1338 4: 19 5: 17 6: 5568 7: 12 8: 1 9: 10 10: 9 11: 32 12: 20 13: 439 14: 81 15: 3 17: 165 16: 194 19: 55 18: 3 Not well distributed at all... [1] https://github.com/dfilimon/knn/blob/57aad6b0695782d06a0f1a989aca7b243420f611/src/test/java/org/apache/mahout/knn/experimental/EvaluateClustering.java On Tue, Nov 27, 2012 at 12:22 PM, Dan Filimon <[email protected]> wrote: > Hi everyone, > > Small update about Streaming KMeans: a first version of the map reduce > code is ready in [1] (the mapreduce branch). > It seems to be working for synthetic data so I'm now starting to look > at real data. > > So, I talked with Ted and we decided the first thing to test on is > clustering the 20 newsgroups data set. > I used [2] (seqdirectory and seq2sparse) to create a sequence file > (Text, Text) with entries for each file in the 20 newsgroups data set. > It seemed reasonable to use DictionaryVectorized (although Ted > suggested using AdaptiveValueEncoder for creating the vectors) for > TF-IDF for scoring. > > The vectors I get have > 90000 features. > So, I projected these vectors on to 50 uniformly distributed > normalized vectors. So, I create a dimension 50 vector for each > document. > > I then ran Streaming KMeans and then clustered the resulting centroids > with Ball KMeans down to 20 groups (the actual number of clusters). > That code is here [3]. > > The thing is the resulting clusters (after KMeans) are extremely > uneven and the final clusters are very wrong (there is a final cluster > with > 5000 points) even though in reality the maximum real cluster is > of size 1000. > > I now realized that I haven't stripped the data of the headers, but I > doubt this might cause the brokenness I'm seeing. > > Are the steps reasonable? In which case one of the k-means steps is > probably wrong? Or am I doing something illogical? > > Thanks! :) > > [1] > https://github.com/dfilimon/knn/tree/mapreduce/src/main/java/org/apache/mahout/knn/experimental > [2] > https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html > [3] > https://github.com/dfilimon/knn/blob/mapreduce/src/test/java/org/apache/mahout/knn/experimental/EvaluateClustering.java
