Re: RowSimilarity ?'s

Sean Owen Thu, 14 Jul 2011 12:24:59 -0700

On Thu, Jul 14, 2011 at 8:00 PM, Grant Ingersoll <[email protected]>wrote:


>
> > You need all cooccurrences since some implementations need that value,
> and
> > you're computing all-pairs.
>
> Can you explain the diffs from the cited paper?  (Per the comment in the
> top of the Job file)
>
> For the record, I'm currently running this on ~500K rows and ~150K terms
> (each vector is pretty sparse) and it is taking a long time, way longer than
> what is cited in the paper for what appears to be a bigger corpus with more
> terms on crappier hardware.
>
>
It's the same approach in the important aspects, as far as I can tell. One
thing they did is removed the top 1% longest posting lists entirely, which
is exactly the long-rows (here, 'columns') issue Ted mentioned.

There is not a lever here that helps truncate these big rows/columns, which
can dominate the run time -- that could be useful. Just needs a rule for
tossing data -- you could simply throw away such columns (ouch), or at least
use only a sampled subset of it.

That's my guess.

Re: RowSimilarity ?'s

Reply via email to