On Thu, Jul 14, 2011 at 8:00 PM, Grant Ingersoll <[email protected]>wrote:
> > > You need all cooccurrences since some implementations need that value, > and > > you're computing all-pairs. > > Can you explain the diffs from the cited paper? (Per the comment in the > top of the Job file) > > For the record, I'm currently running this on ~500K rows and ~150K terms > (each vector is pretty sparse) and it is taking a long time, way longer than > what is cited in the paper for what appears to be a bigger corpus with more > terms on crappier hardware. > > It's the same approach in the important aspects, as far as I can tell. One thing they did is removed the top 1% longest posting lists entirely, which is exactly the long-rows (here, 'columns') issue Ted mentioned. There is not a lever here that helps truncate these big rows/columns, which can dominate the run time -- that could be useful. Just needs a rule for tossing data -- you could simply throw away such columns (ouch), or at least use only a sampled subset of it. That's my guess.
