Running k-means in R with the projected 50-dimensional vectors gets me
the following sizes for the 20 clusters:

K-means clustering with 20 clusters of sizes 140, 1195, 228, 3081,
2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66,
78, 2962

I guess projecting them might be the issue... (this is for 50 iterations).

On Tue, Nov 27, 2012 at 4:29 PM, Ted Dunning <[email protected]> wrote:
> Wrong in the sense of clustering is hard to define.  Certainly a wide range
> of cluster sizes looks dubious, but not definitive.
>
> Next easy steps include cosine normalizing the vectors and doing
> semi-supervised clustering.  Clustering the 50d data in R might also be
> useful.  Normalizing is a single method call in the normal flow.  It can be
> done on the projected vectors without loss of generality.  After cosine
> normalization, semi-supervised clustering can be done by adding an
> additional 20 dimensions with a 1 of n encoding of the correct newsgroup.
>  IN the test data, these can be set to all zeros.  This gives the
> clustering algorithm a strong hint about what you are thinking about.
>
> It is also worth checking the sum os squared distance to make sure it is
> relatively small.
>
> On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon 
> <[email protected]>wrote:
>
>> They're both wrong! :(
>>

Reply via email to