No, I abandoned creating the dense CSV file on my laptop as I have no more space. I'll try running biganalytics on my desktop where I have more space. I uploaded the vectors-initial.mm file with the sparse matrix.
On Fri, Nov 30, 2012 at 12:46 AM, Ted Dunning <[email protected]> wrote: > Hmm... > > I will have to take a look. > > Is your CSV file on EC2 as before? > > On Thu, Nov 29, 2012 at 1:26 PM, Dan Filimon > <[email protected]>wrote: > >> Ted, I'm having issues with clustering the data with R. It wants to >> convert it into a dense matrix for clustering apparently. >> >> > kmeans(M, 20, iter.max=20) >> Error in asMethod(object) : cannot allocate vector of length 1705007196 >> >> There's an as.matrix(...) call that's responsible. >> There's the biganalytics package [1] which supports file-backed >> matrices, but if I attempt to make my sparse matrix a big.matrix, >> it'll still fail: >> >> > big.matrix(M) >> Error in nrow < 1 : cannot allocate vector of length 1705007196 >> >> So, I think there's no way I can read it as a sparse Market Matrix and >> run k-means on it. On the other hand, if I want to use bigkmeans >> provided by biganalytics, but that doesn't work directly either >> >> > bigkmeans(M, 20, iter.max=20) >> Error in duplicated.default(centers[[length(centers)]]) : >> duplicated() applies only to vectors >> >> So, it seems that I have to read in a big.matrix, from disk, but that >> would mean building a dense CSV file like I tried earlier. That would >> be over 12GB in size though... >> Any other ideas? >> >> [1] http://cran.r-project.org/web/packages/biganalytics/biganalytics.pdf >> >> On Tue, Nov 27, 2012 at 9:46 PM, Dan Filimon >> <[email protected]> wrote: >> > Running k-means in R with the projected 50-dimensional vectors gets me >> > the following sizes for the 20 clusters: >> > >> > K-means clustering with 20 clusters of sizes 140, 1195, 228, 3081, >> > 2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66, >> > 78, 2962 >> > >> > I guess projecting them might be the issue... (this is for 50 >> iterations). >> > >> > On Tue, Nov 27, 2012 at 4:29 PM, Ted Dunning <[email protected]> >> wrote: >> >> Wrong in the sense of clustering is hard to define. Certainly a wide >> range >> >> of cluster sizes looks dubious, but not definitive. >> >> >> >> Next easy steps include cosine normalizing the vectors and doing >> >> semi-supervised clustering. Clustering the 50d data in R might also be >> >> useful. Normalizing is a single method call in the normal flow. It >> can be >> >> done on the projected vectors without loss of generality. After cosine >> >> normalization, semi-supervised clustering can be done by adding an >> >> additional 20 dimensions with a 1 of n encoding of the correct >> newsgroup. >> >> IN the test data, these can be set to all zeros. This gives the >> >> clustering algorithm a strong hint about what you are thinking about. >> >> >> >> It is also worth checking the sum os squared distance to make sure it is >> >> relatively small. >> >> >> >> On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon < >> [email protected]>wrote: >> >> >> >>> They're both wrong! :( >> >>> >>
