Finally having internet access here again (I'm on an island at a Cloud
Computing Summerschool with little to no WiFi...)
I agree with everything that has been said so far. RowSimilarityJob's
design is heavily biased towards genericity (supporting arbitrary
user-defined similarity measures) and therefore the performance is
dependent on the shape of the input and intelligent sampling/pruning.
However you are completely right that there is much room for improvement
as there are some possibilities to use combiners as already proposed
here. I think that the similarity measures would have to be divided into
three "classes" and RowSimilarityJob needs a specialized implementation
for each of these. I have already planned to invest a little time into
that recently, maybe I'll find time to go at that in the next weeks.
Class 1 would be count based similarity measures like
Tanimoto-coefficient or LLR that can be easily combined by summing the
partial counts.
Class 2 would be measures that only need the cooccurrences between the
vectors like Pearson-Correlation or Euclidean distance or Cosine if the
vectors are normalized, it should be possible to find intelligent (yet a
bit hacky) ways to combine their intermediate data.
Class 3 would be measures that are possibly user-supplied and need the
"weight" of the input vectors as well as all the cooccurrences.
I think we definitely need to have a combiner included in
RowSimilarityJob (I tried that some time ago but dropped it as it didn't
help my particular usecase) to increase its performance but we should
not sacrifice giving the users the possibility to create their own
similarity measures, foursquare explicitly spoke high of the possibility
of customisation: "On top of this we layered a package called Mahout,
which was relatively easy to modify in order to compute custom
similarity scores."
RowSimilarityJob is based on papers dealing with document similarity but
is currently mostly/only used for neighbourhood based collaborative
filtering where input matrices are much more sparse usually. I'd propose
that in addition to adding combiners we should create an additional
wrapping job for comparing documents around RowSimilarityJob in the same
way as ItemSimilarityJob wraps it for CF. The DocumentSimilarityJob
could also provide text-specific sampling/pruning strategies like
removing terms with high DF e.g. Actually RowSimilarityJob should not be
used on it's own by people not aware of its "pitfalls"
I also remember that we once had someone on the list that used
RowSimilarityJob for precomputing the similarities between millions of
documents. Unfortunately I couldn't find the conversation yet. IIRC he
successfully applied a very aggressive sampling strategy.
--sebastian
On 19.07.2011 02:19, Grant Ingersoll wrote:
On Jul 18, 2011, at 6:09 PM, Sean Owen wrote:
Right! but how do you do that if you only saved co-occurrence counts?
You can surely pull a very similarly-shaped trick to calculate the
cosine measure; that's exactly what this paper is doing in fact. But
it's a different computation.
Right now the job saves *all* the info it might need to calculate any
of these things later. And that's heavy.
Yes. That is the thing I am questioning. Do we need to do that? I'm arguing
that doing so makes for an algorithm that doesn't scale, even if it is correct.
On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix<[email protected]> wrote:
On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen<[email protected]> wrote:
How do you implement, for instance, the cosine similarity with this output?
That's the intent behind preserving this info, which is surely a lot
to preserve.
Sorry to jump in the middle of this, but cosine is not too hard to use nice
combiners, as it can be done by first normalizing the rows and then
doing my ubiquitous "outer product of columns" trick on the resultant
corpus (this latter job uses combiners easily because the mappers do all
multiplications, and all reducers are simply sums, and thus are commutative
and associative).
Not sure about the other fancy similarities.
--------------------------
Grant Ingersoll