Re: RowSimilarity ?'s

Sebastian Schelter Tue, 19 Jul 2011 00:25:27 -0700

Finally having internet access here again (I'm on an island at a CloudComputing Summerschool with little to no WiFi...)

I agree with everything that has been said so far. RowSimilarityJob'sdesign is heavily biased towards genericity (supporting arbitraryuser-defined similarity measures) and therefore the performance isdependent on the shape of the input and intelligent sampling/pruning.

However you are completely right that there is much room for improvementas there are some possibilities to use combiners as already proposedhere. I think that the similarity measures would have to be divided intothree "classes" and RowSimilarityJob needs a specialized implementationfor each of these. I have already planned to invest a little time intothat recently, maybe I'll find time to go at that in the next weeks.

Class 1 would be count based similarity measures likeTanimoto-coefficient or LLR that can be easily combined by summing thepartial counts.

Class 2 would be measures that only need the cooccurrences between thevectors like Pearson-Correlation or Euclidean distance or Cosine if thevectors are normalized, it should be possible to find intelligent (yet abit hacky) ways to combine their intermediate data.

Class 3 would be measures that are possibly user-supplied and need the"weight" of the input vectors as well as all the cooccurrences.

I think we definitely need to have a combiner included inRowSimilarityJob (I tried that some time ago but dropped it as it didn'thelp my particular usecase) to increase its performance but we shouldnot sacrifice giving the users the possibility to create their ownsimilarity measures, foursquare explicitly spoke high of the possibilityof customisation: "On top of this we layered a package called Mahout,which was relatively easy to modify in order to compute customsimilarity scores."

RowSimilarityJob is based on papers dealing with document similarity butis currently mostly/only used for neighbourhood based collaborativefiltering where input matrices are much more sparse usually. I'd proposethat in addition to adding combiners we should create an additionalwrapping job for comparing documents around RowSimilarityJob in the sameway as ItemSimilarityJob wraps it for CF. The DocumentSimilarityJobcould also provide text-specific sampling/pruning strategies likeremoving terms with high DF e.g. Actually RowSimilarityJob should not beused on it's own by people not aware of its "pitfalls"

I also remember that we once had someone on the list that usedRowSimilarityJob for precomputing the similarities between millions ofdocuments. Unfortunately I couldn't find the conversation yet. IIRC hesuccessfully applied a very aggressive sampling strategy.


--sebastian


On 19.07.2011 02:19, Grant Ingersoll wrote:


On Jul 18, 2011, at 6:09 PM, Sean Owen wrote:

Right! but how do you do that if you only saved co-occurrence counts?

You can surely pull a very similarly-shaped trick to calculate the
cosine measure; that's exactly what this paper is doing in fact. But
it's a different computation.

Right now the job saves *all* the info it might need to calculate any
of these things later. And that's heavy.


Yes.  That is the thing I am questioning.  Do we need to do that?  I'm arguing 
that doing so makes for an algorithm that doesn't scale, even if it is correct.


On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix<[email protected]>  wrote:

On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen<[email protected]>  wrote:

How do you implement, for instance, the cosine similarity with this output?
That's the intent behind preserving this info, which is surely a lot
to preserve.


Sorry to jump in the middle of this, but cosine is not too hard to use nice
combiners, as it can be done by first normalizing the rows and then
doing my ubiquitous "outer product of columns" trick on the resultant
corpus (this latter job uses combiners easily because the mappers do all
multiplications, and all reducers are simply sums, and thus are commutative
and associative).

Not sure about the other fancy similarities.


--------------------------
Grant Ingersoll

Re: RowSimilarity ?'s

Reply via email to