Wrong in the sense of clustering is hard to define. Certainly a wide range of cluster sizes looks dubious, but not definitive.
Next easy steps include cosine normalizing the vectors and doing semi-supervised clustering. Clustering the 50d data in R might also be useful. Normalizing is a single method call in the normal flow. It can be done on the projected vectors without loss of generality. After cosine normalization, semi-supervised clustering can be done by adding an additional 20 dimensions with a 1 of n encoding of the correct newsgroup. IN the test data, these can be set to all zeros. This gives the clustering algorithm a strong hint about what you are thinking about. It is also worth checking the sum os squared distance to make sure it is relatively small. On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon <[email protected]>wrote: > They're both wrong! :( >
