Re: RowSimilarity ?'s

Grant Ingersoll Thu, 14 Jul 2011 12:44:14 -0700

On Jul 14, 2011, at 2:43 PM, Ted Dunning wrote:

> The typical use with specialized distance functions would be to get the
> cross product of a small-ish number of items against a very large number of
> items.  If we assume that the small set fits in memory then we have Grant's
> recently proposed utility.


See MAHOUT-763.  Almost done w/ the coding.


> 
> On Thu, Jul 14, 2011 at 11:19 AM, Sean Owen <[email protected]> wrote:
> 
>> I think the answer is that this is a different beast. It is a fully
>> distributed computation, and doesn't have the row
>> Vectors themselves together at the same time. (That would be much more
>> expensive to output -- the cross product of all rows with themselves.) So
>> those other measure implementations can't be applied -- or rather, there's
>> a
>> more efficient way of computing all-pairs similarity here.
>> 
>> You need all cooccurrences since some implementations need that value, and
>> you're computing all-pairs. (I'm sure you can hack away the cooccurrence
>> computation if you know your metric doesn't use it.)
>> 
>> There are several levers you can pull, including one like Ted mentions --
>> maxSimilaritiesPerRow.
>> 
>> On Thu, Jul 14, 2011 at 6:17 PM, Grant Ingersoll <[email protected]
>>> wrote:
>>> 
>>> Any thoughts on why not reuse our existing Distance measures?  Seems like
>>> once you know that two vectors have something in common, there isn't much
>>> point in calculating all the co-occurrences, just save of those two (or
>>> whatever) and then later call the distance measure on the vectors.
>>> 
>>> 
>> 

--------------------------
Grant Ingersoll

Re: RowSimilarity ?'s

Reply via email to