On Jul 14, 2011, at 2:19 PM, Sean Owen wrote:

> I think the answer is that this is a different beast. It is a fully
> distributed computation, and doesn't have the row
> Vectors themselves together at the same time. (That would be much more
> expensive to output -- the cross product of all rows with themselves.) So
> those other measure implementations can't be applied -- or rather, there's a
> more efficient way of computing all-pairs similarity here.
> 
> You need all cooccurrences since some implementations need that value, and
> you're computing all-pairs.

Can you explain the diffs from the cited paper?  (Per the comment in the top of 
the Job file)  

For the record, I'm currently running this on ~500K rows and ~150K terms (each 
vector is pretty sparse) and it is taking a long time, way longer than what is 
cited in the paper for what appears to be a bigger corpus with more terms on 
crappier hardware.


> (I'm sure you can hack away the cooccurrence
> computation if you know your metric doesn't use it.)
> 
> There are several levers you can pull, including one like Ted mentions --
> maxSimilaritiesPerRow.
> 
> On Thu, Jul 14, 2011 at 6:17 PM, Grant Ingersoll <[email protected]>wrote:
>> 
>> Any thoughts on why not reuse our existing Distance measures?  Seems like
>> once you know that two vectors have something in common, there isn't much
>> point in calculating all the co-occurrences, just save of those two (or
>> whatever) and then later call the distance measure on the vectors.
>> 
>> 

--------------------------
Grant Ingersoll



Reply via email to