Hi everyone,

Small update about Streaming KMeans: a first version of the map reduce
code is ready in [1] (the mapreduce branch).
It seems to be working for synthetic data so I'm now starting to look
at real data.

So, I talked with Ted and we decided the first thing to test on is
clustering the 20 newsgroups data set.
I used [2] (seqdirectory and seq2sparse) to create a sequence file
(Text, Text) with entries for each file in the 20 newsgroups data set.
It seemed reasonable to use DictionaryVectorized (although Ted
suggested using AdaptiveValueEncoder for creating the vectors) for
TF-IDF for scoring.

The vectors I get have > 90000 features.
So, I projected these vectors on to 50 uniformly distributed
normalized vectors. So, I create a dimension 50 vector for each
document.

I then ran Streaming KMeans and then clustered the resulting centroids
with Ball KMeans down to 20 groups (the actual number of clusters).
That code is here [3].

The thing is the resulting clusters (after KMeans) are extremely
uneven and the final clusters are very wrong (there is a final cluster
with > 5000 points) even though in reality the maximum real cluster is
of size 1000.

I now realized that I haven't stripped the data of the headers, but I
doubt this might cause the brokenness I'm seeing.

Are the steps reasonable? In which case one of the k-means steps is
probably wrong? Or am I doing something illogical?

Thanks! :)

[1] 
https://github.com/dfilimon/knn/tree/mapreduce/src/main/java/org/apache/mahout/knn/experimental
[2] 
https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html
[3] 
https://github.com/dfilimon/knn/blob/mapreduce/src/test/java/org/apache/mahout/knn/experimental/EvaluateClustering.java

Reply via email to