On Thu, Feb 25, 2010 at 1:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> After we delete hapax, we may have considerably fewer tokens. But the LLR > step that Robin implied may have already dealt with that. > The more I think about it, the better I think about doing the LLR-filtered 5-gram extraction of 5 million wikipedia pages, then SVD the transpose of this matrix and we've got the "record breaking" scale on a non-synthetic data set which should work on the current code (without the stochastic approximation). 5M documents means we're getting the left-singular vectors, instead of the usual *right* singular vectors, but that's actually nice, in that you can immediately run clustering on the output and you're clustering *documents* without a second pass through to produce these vectors from the right singular vectors. 5M is even small enough that we won't need the MAHOUT-310 memory/disk tradeoff enhancement, because 300 vectors * 8 bytes * 5M docs = 10GB, and my desktop has 24GB now. :) So first step: we need to extract out the sparse vectors for 5grams with LLR > 1 from wikipedia, then take the transpose of that matrix. We don't have the MR code to do that last step checked in yet, but it's a pretty critically useful enhancement to DistributedSparseRowMatrix which we need anyways. -jake -jake > > On Thu, Feb 25, 2010 at 1:43 PM, Jake Mannix <jake.man...@gmail.com> > wrote: > > > Of course, at this point we've > > got > > too many terms to properly do the decomposition directly on the input > > matrix, > > we'd have to do it on the transpose, > > > > > > -- > Ted Dunning, CTO > DeepDyve >