On Jul 14, 2011, at 2:43 PM, Ted Dunning wrote: > The typical use with specialized distance functions would be to get the > cross product of a small-ish number of items against a very large number of > items. If we assume that the small set fits in memory then we have Grant's > recently proposed utility.
See MAHOUT-763. Almost done w/ the coding. > > On Thu, Jul 14, 2011 at 11:19 AM, Sean Owen <[email protected]> wrote: > >> I think the answer is that this is a different beast. It is a fully >> distributed computation, and doesn't have the row >> Vectors themselves together at the same time. (That would be much more >> expensive to output -- the cross product of all rows with themselves.) So >> those other measure implementations can't be applied -- or rather, there's >> a >> more efficient way of computing all-pairs similarity here. >> >> You need all cooccurrences since some implementations need that value, and >> you're computing all-pairs. (I'm sure you can hack away the cooccurrence >> computation if you know your metric doesn't use it.) >> >> There are several levers you can pull, including one like Ted mentions -- >> maxSimilaritiesPerRow. >> >> On Thu, Jul 14, 2011 at 6:17 PM, Grant Ingersoll <[email protected] >>> wrote: >>> >>> Any thoughts on why not reuse our existing Distance measures? Seems like >>> once you know that two vectors have something in common, there isn't much >>> point in calculating all the co-occurrences, just save of those two (or >>> whatever) and then later call the distance measure on the vectors. >>> >>> >> -------------------------- Grant Ingersoll
