Re: RowSimilarity ?'s

Grant Ingersoll Thu, 14 Jul 2011 12:43:09 -0700

On Jul 14, 2011, at 3:24 PM, Sean Owen wrote:

> On Thu, Jul 14, 2011 at 8:00 PM, Grant Ingersoll <[email protected]>wrote:
> 
>> 
>>> You need all cooccurrences since some implementations need that value,
>> and
>>> you're computing all-pairs.
>> 
>> Can you explain the diffs from the cited paper?  (Per the comment in the
>> top of the Job file)
>> 
>> For the record, I'm currently running this on ~500K rows and ~150K terms
>> (each vector is pretty sparse) and it is taking a long time, way longer than
>> what is cited in the paper for what appears to be a bigger corpus with more
>> terms on crappier hardware.
>> 
>> 
> It's the same approach in the important aspects, as far as I can tell.
> One
> thing they did is removed the top 1% longest posting lists entirely, which
> is exactly the long-rows (here, 'columns') issue Ted mentioned.


Our vectors were gen'd w/ seq2sparse w/ minDF 2 and the max percent of 90%, so 
I believe that is also more aggressive than the paper and is what results in 
the 150K terms.

I'll see if we can run some more tests and get some more insight into which 
phases are taking so long.

> 
> There is not a lever here that helps truncate these big rows/columns, which
> can dominate the run time -- that could be useful. Just needs a rule for
> tossing data -- you could simply throw away such columns (ouch), or at least
> use only a sampled subset of it.
> 
> That's my guess.

--------------------------
Grant Ingersoll

Re: RowSimilarity ?'s

Reply via email to