Re: Streaming KMeans 20newsgroups clustering

Dan Filimon Thu, 29 Nov 2012 14:55:56 -0800

No, I abandoned creating the dense CSV file on my laptop as I have no
more space. I'll try running biganalytics on my desktop where I have
more space.
I uploaded the vectors-initial.mm file with the sparse matrix.


On Fri, Nov 30, 2012 at 12:46 AM, Ted Dunning <[email protected]> wrote:
> Hmm...
>
> I will have to take a look.
>
> Is your CSV file on EC2 as before?
>
> On Thu, Nov 29, 2012 at 1:26 PM, Dan Filimon 
> <[email protected]>wrote:
>
>> Ted, I'm having issues with clustering the data with R. It wants to
>> convert it into a dense matrix for clustering apparently.
>>
>> > kmeans(M, 20, iter.max=20)
>> Error in asMethod(object) : cannot allocate vector of length 1705007196
>>
>> There's an as.matrix(...) call that's responsible.
>> There's the biganalytics package [1] which supports file-backed
>> matrices, but if I attempt to make my sparse matrix a big.matrix,
>> it'll still fail:
>>
>> > big.matrix(M)
>> Error in nrow < 1 : cannot allocate vector of length 1705007196
>>
>> So, I think there's no way I can read it as a sparse Market Matrix and
>> run k-means on it. On the other hand, if I want to use bigkmeans
>> provided by biganalytics, but that doesn't work directly either
>>
>> > bigkmeans(M, 20, iter.max=20)
>> Error in duplicated.default(centers[[length(centers)]]) :
>>   duplicated() applies only to vectors
>>
>> So, it seems that I have to read in a big.matrix, from disk, but that
>> would mean building a dense CSV file like I tried earlier. That would
>> be over 12GB in size though...
>> Any other ideas?
>>
>> [1] http://cran.r-project.org/web/packages/biganalytics/biganalytics.pdf
>>
>> On Tue, Nov 27, 2012 at 9:46 PM, Dan Filimon
>> <[email protected]> wrote:
>> > Running k-means in R with the projected 50-dimensional vectors gets me
>> > the following sizes for the 20 clusters:
>> >
>> > K-means clustering with 20 clusters of sizes 140, 1195, 228, 3081,
>> > 2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66,
>> > 78, 2962
>> >
>> > I guess projecting them might be the issue... (this is for 50
>> iterations).
>> >
>> > On Tue, Nov 27, 2012 at 4:29 PM, Ted Dunning <[email protected]>
>> wrote:
>> >> Wrong in the sense of clustering is hard to define.  Certainly a wide
>> range
>> >> of cluster sizes looks dubious, but not definitive.
>> >>
>> >> Next easy steps include cosine normalizing the vectors and doing
>> >> semi-supervised clustering.  Clustering the 50d data in R might also be
>> >> useful.  Normalizing is a single method call in the normal flow.  It
>> can be
>> >> done on the projected vectors without loss of generality.  After cosine
>> >> normalization, semi-supervised clustering can be done by adding an
>> >> additional 20 dimensions with a 1 of n encoding of the correct
>> newsgroup.
>> >>  IN the test data, these can be set to all zeros.  This gives the
>> >> clustering algorithm a strong hint about what you are thinking about.
>> >>
>> >> It is also worth checking the sum os squared distance to make sure it is
>> >> relatively small.
>> >>
>> >> On Tue, Nov 27, 2012 at 5:42 AM, Dan Filimon <
>> [email protected]>wrote:
>> >>
>> >>> They're both wrong! :(
>> >>>
>>

Re: Streaming KMeans 20newsgroups clustering

Reply via email to