On Thu, Feb 25, 2010 at 2:09 PM, Jake Mannix <jake.man...@gmail.com> wrote:

> On Thu, Feb 25, 2010 at 1:48 PM, Ted Dunning <ted.dunn...@gmail.com>wrote:
>
>> After we delete hapax, we may have considerably fewer tokens.  But the LLR
>> step that Robin implied may have already dealt with that.
>>
>
> The more I think about it, the better I think about doing the LLR-filtered
> 5-gram
> extraction of 5 million wikipedia pages, then SVD the transpose of this
> matrix
> and we've got the "record breaking" scale on a non-synthetic data set
> which
> should work on the current code (without the stochastic approximation).
>

Blarg. Our current Vectorizer code for text builds a nice term -> termId
dictionary,
but we keep the keys on the rows of the matrix is left as Text: the matrix
created
is backed by a SequenceFile<Text,VectorWritable>, which is fine for SVD,
but
not so fine for doing transpose, because we don't have a document ->
documentId
dictionary.

I've written the M/R job in DistributedRowMatrix to do transpose, but our
document
matrixes produced by SparseVectorsFromSequenceFiles don't have a
integer-valued
keys for the rows, so transpose doesn't yet make sense.  Fooey.  More work
to do.

  -jake

Reply via email to