Running k-means in R with the projected 50-dimensional vectors gets me the following sizes for the 20 clusters:
K-means clustering with 20 clusters of sizes 140, 1195, 228, 3081, 2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66, 78, 2962 I guess projecting them might be the issue... (this is for 50 iterations). On Tue, Nov 27, 2012 at 4:29 PM, Ted Dunning <[email protected]> wrote: > Wrong in the sense of clustering is hard to define. Certainly a wide range > of cluster sizes looks dubious, but not definitive. > > Next easy steps include cosine normalizing the vectors and doing > semi-supervised clustering. Clustering the 50d data in R might also be > useful. Normalizing is a single method call in the normal flow. It can be > done on the projected vectors without loss of generality. After cosine > normalization, semi-supervised clustering can be done by adding an > additional 20 dimensions with a 1 of n encoding of the correct newsgroup. > IN the test data, these can be set to all zeros. This gives the > clustering algorithm a strong hint about what you are thinking about. > > It is also worth checking the sum os squared distance to make sure it is > relatively small. > > On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon > <[email protected]>wrote: > >> They're both wrong! :( >>
