Hmm... I will have to take a look.
Is your CSV file on EC2 as before? On Thu, Nov 29, 2012 at 1:26 PM, Dan Filimon <[email protected]>wrote: > Ted, I'm having issues with clustering the data with R. It wants to > convert it into a dense matrix for clustering apparently. > > > kmeans(M, 20, iter.max=20) > Error in asMethod(object) : cannot allocate vector of length 1705007196 > > There's an as.matrix(...) call that's responsible. > There's the biganalytics package [1] which supports file-backed > matrices, but if I attempt to make my sparse matrix a big.matrix, > it'll still fail: > > > big.matrix(M) > Error in nrow < 1 : cannot allocate vector of length 1705007196 > > So, I think there's no way I can read it as a sparse Market Matrix and > run k-means on it. On the other hand, if I want to use bigkmeans > provided by biganalytics, but that doesn't work directly either > > > bigkmeans(M, 20, iter.max=20) > Error in duplicated.default(centers[[length(centers)]]) : > duplicated() applies only to vectors > > So, it seems that I have to read in a big.matrix, from disk, but that > would mean building a dense CSV file like I tried earlier. That would > be over 12GB in size though... > Any other ideas? > > [1] http://cran.r-project.org/web/packages/biganalytics/biganalytics.pdf > > On Tue, Nov 27, 2012 at 9:46 PM, Dan Filimon > <[email protected]> wrote: > > Running k-means in R with the projected 50-dimensional vectors gets me > > the following sizes for the 20 clusters: > > > > K-means clustering with 20 clusters of sizes 140, 1195, 228, 3081, > > 2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66, > > 78, 2962 > > > > I guess projecting them might be the issue... (this is for 50 > iterations). > > > > On Tue, Nov 27, 2012 at 4:29 PM, Ted Dunning <[email protected]> > wrote: > >> Wrong in the sense of clustering is hard to define. Certainly a wide > range > >> of cluster sizes looks dubious, but not definitive. > >> > >> Next easy steps include cosine normalizing the vectors and doing > >> semi-supervised clustering. Clustering the 50d data in R might also be > >> useful. Normalizing is a single method call in the normal flow. It > can be > >> done on the projected vectors without loss of generality. After cosine > >> normalization, semi-supervised clustering can be done by adding an > >> additional 20 dimensions with a 1 of n encoding of the correct > newsgroup. > >> IN the test data, these can be set to all zeros. This gives the > >> clustering algorithm a strong hint about what you are thinking about. > >> > >> It is also worth checking the sum os squared distance to make sure it is > >> relatively small. > >> > >> On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon < > [email protected]>wrote: > >> > >>> They're both wrong! :( > >>> >
