Finally having internet access here again (I'm on an island at a Cloud Computing Summerschool with little to no WiFi...)

I agree with everything that has been said so far. RowSimilarityJob's design is heavily biased towards genericity (supporting arbitrary user-defined similarity measures) and therefore the performance is dependent on the shape of the input and intelligent sampling/pruning.

However you are completely right that there is much room for improvement as there are some possibilities to use combiners as already proposed here. I think that the similarity measures would have to be divided into three "classes" and RowSimilarityJob needs a specialized implementation for each of these. I have already planned to invest a little time into that recently, maybe I'll find time to go at that in the next weeks.

Class 1 would be count based similarity measures like Tanimoto-coefficient or LLR that can be easily combined by summing the partial counts.

Class 2 would be measures that only need the cooccurrences between the vectors like Pearson-Correlation or Euclidean distance or Cosine if the vectors are normalized, it should be possible to find intelligent (yet a bit hacky) ways to combine their intermediate data.

Class 3 would be measures that are possibly user-supplied and need the "weight" of the input vectors as well as all the cooccurrences.

I think we definitely need to have a combiner included in RowSimilarityJob (I tried that some time ago but dropped it as it didn't help my particular usecase) to increase its performance but we should not sacrifice giving the users the possibility to create their own similarity measures, foursquare explicitly spoke high of the possibility of customisation: "On top of this we layered a package called Mahout, which was relatively easy to modify in order to compute custom similarity scores."

RowSimilarityJob is based on papers dealing with document similarity but is currently mostly/only used for neighbourhood based collaborative filtering where input matrices are much more sparse usually. I'd propose that in addition to adding combiners we should create an additional wrapping job for comparing documents around RowSimilarityJob in the same way as ItemSimilarityJob wraps it for CF. The DocumentSimilarityJob could also provide text-specific sampling/pruning strategies like removing terms with high DF e.g. Actually RowSimilarityJob should not be used on it's own by people not aware of its "pitfalls"

I also remember that we once had someone on the list that used RowSimilarityJob for precomputing the similarities between millions of documents. Unfortunately I couldn't find the conversation yet. IIRC he successfully applied a very aggressive sampling strategy.

--sebastian


On 19.07.2011 02:19, Grant Ingersoll wrote:

On Jul 18, 2011, at 6:09 PM, Sean Owen wrote:

Right! but how do you do that if you only saved co-occurrence counts?

You can surely pull a very similarly-shaped trick to calculate the
cosine measure; that's exactly what this paper is doing in fact. But
it's a different computation.

Right now the job saves *all* the info it might need to calculate any
of these things later. And that's heavy.

Yes.  That is the thing I am questioning.  Do we need to do that?  I'm arguing 
that doing so makes for an algorithm that doesn't scale, even if it is correct.


On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix<[email protected]>  wrote:
On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen<[email protected]>  wrote:

How do you implement, for instance, the cosine similarity with this output?
That's the intent behind preserving this info, which is surely a lot
to preserve.


Sorry to jump in the middle of this, but cosine is not too hard to use nice
combiners, as it can be done by first normalizing the rows and then
doing my ubiquitous "outer product of columns" trick on the resultant
corpus (this latter job uses combiners easily because the mappers do all
multiplications, and all reducers are simply sums, and thus are commutative
and associative).

Not sure about the other fancy similarities.

--------------------------
Grant Ingersoll




Reply via email to