On Thu, Feb 25, 2010 at 2:09 PM, Jake Mannix <jake.man...@gmail.com> wrote:
> On Thu, Feb 25, 2010 at 1:48 PM, Ted Dunning <ted.dunn...@gmail.com>wrote: > >> After we delete hapax, we may have considerably fewer tokens. But the LLR >> step that Robin implied may have already dealt with that. >> > > The more I think about it, the better I think about doing the LLR-filtered > 5-gram > extraction of 5 million wikipedia pages, then SVD the transpose of this > matrix > and we've got the "record breaking" scale on a non-synthetic data set > which > should work on the current code (without the stochastic approximation). > Blarg. Our current Vectorizer code for text builds a nice term -> termId dictionary, but we keep the keys on the rows of the matrix is left as Text: the matrix created is backed by a SequenceFile<Text,VectorWritable>, which is fine for SVD, but not so fine for doing transpose, because we don't have a document -> documentId dictionary. I've written the M/R job in DistributedRowMatrix to do transpose, but our document matrixes produced by SparseVectorsFromSequenceFiles don't have a integer-valued keys for the rows, so transpose doesn't yet make sense. Fooey. More work to do. -jake