Wrong in the sense of clustering is hard to define.  Certainly a wide range
of cluster sizes looks dubious, but not definitive.

Next easy steps include cosine normalizing the vectors and doing
semi-supervised clustering.  Clustering the 50d data in R might also be
useful.  Normalizing is a single method call in the normal flow.  It can be
done on the projected vectors without loss of generality.  After cosine
normalization, semi-supervised clustering can be done by adding an
additional 20 dimensions with a 1 of n encoding of the correct newsgroup.
 IN the test data, these can be set to all zeros.  This gives the
clustering algorithm a strong hint about what you are thinking about.

It is also worth checking the sum os squared distance to make sure it is
relatively small.

On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon <[email protected]>wrote:

> They're both wrong! :(
>

Reply via email to