On Thu, Feb 25, 2010 at 1:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> After we delete hapax, we may have considerably fewer tokens.  But the LLR
> step that Robin implied may have already dealt with that.
>

The more I think about it, the better I think about doing the LLR-filtered
5-gram
extraction of 5 million wikipedia pages, then SVD the transpose of this
matrix
and we've got the "record breaking" scale on a non-synthetic data set which
should work on the current code (without the stochastic approximation).

5M documents means we're getting the left-singular vectors, instead of the
usual
*right* singular vectors, but that's actually nice, in that you can
immediately
run clustering on the output and you're clustering *documents* without a
second
pass through to produce these vectors from the right singular vectors.

5M is even small enough that we won't need the MAHOUT-310 memory/disk
tradeoff enhancement, because 300 vectors * 8 bytes * 5M docs = 10GB, and
my desktop has 24GB now.  :)

So first step: we need to extract out the sparse vectors for 5grams with LLR
> 1
from wikipedia, then take the transpose of that matrix.  We don't have the
MR
code to do that last step checked in yet, but it's a pretty critically
useful
enhancement to DistributedSparseRowMatrix which we need anyways.

  -jake
  -jake



>
> On Thu, Feb 25, 2010 at 1:43 PM, Jake Mannix <jake.man...@gmail.com>
> wrote:
>
> > Of course, at this point we've
> > got
> > too many terms to properly do the decomposition directly on the input
> > matrix,
> > we'd have to do it on the transpose,
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to