Re: Streaming KMeans 20newsgroups clustering

Dan Filimon Tue, 27 Nov 2012 05:43:40 -0800

I updated the code to compare the ball k-means clusters to the ones
generated using streaming k-means + ball k-means.
They're both wrong! :(


This is the updated version [1].

And the output is:

Total number of vectors 18828
Clustering with BallKMeans
0: 3978
1: 5
2: 1859
3: 31
4: 102
5: 14
6: 17
7: 382
8: 1
9: 2946
10: 1096
11: 21
12: 12
13: 4128
14: 70
15: 148
17: 237
16: 3139
19: 634
18: 8
Clustering with StreamingKMeans
0: 2
1: 10711
2: 149
3: 1338
4: 19
5: 17
6: 5568
7: 12
8: 1
9: 10
10: 9
11: 32
12: 20
13: 439
14: 81
15: 3
17: 165
16: 194
19: 55
18: 3

Not well distributed at all...

[1] 
https://github.com/dfilimon/knn/blob/57aad6b0695782d06a0f1a989aca7b243420f611/src/test/java/org/apache/mahout/knn/experimental/EvaluateClustering.java

On Tue, Nov 27, 2012 at 12:22 PM, Dan Filimon
<[email protected]> wrote:
> Hi everyone,
>
> Small update about Streaming KMeans: a first version of the map reduce
> code is ready in [1] (the mapreduce branch).
> It seems to be working for synthetic data so I'm now starting to look
> at real data.
>
> So, I talked with Ted and we decided the first thing to test on is
> clustering the 20 newsgroups data set.
> I used [2] (seqdirectory and seq2sparse) to create a sequence file
> (Text, Text) with entries for each file in the 20 newsgroups data set.
> It seemed reasonable to use DictionaryVectorized (although Ted
> suggested using AdaptiveValueEncoder for creating the vectors) for
> TF-IDF for scoring.
>
> The vectors I get have > 90000 features.
> So, I projected these vectors on to 50 uniformly distributed
> normalized vectors. So, I create a dimension 50 vector for each
> document.
>
> I then ran Streaming KMeans and then clustered the resulting centroids
> with Ball KMeans down to 20 groups (the actual number of clusters).
> That code is here [3].
>
> The thing is the resulting clusters (after KMeans) are extremely
> uneven and the final clusters are very wrong (there is a final cluster
> with > 5000 points) even though in reality the maximum real cluster is
> of size 1000.
>
> I now realized that I haven't stripped the data of the headers, but I
> doubt this might cause the brokenness I'm seeing.
>
> Are the steps reasonable? In which case one of the k-means steps is
> probably wrong? Or am I doing something illogical?
>
> Thanks! :)
>
> [1] 
> https://github.com/dfilimon/knn/tree/mapreduce/src/main/java/org/apache/mahout/knn/experimental
> [2] 
> https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html
> [3] 
> https://github.com/dfilimon/knn/blob/mapreduce/src/test/java/org/apache/mahout/knn/experimental/EvaluateClustering.java

Re: Streaming KMeans 20newsgroups clustering

Reply via email to