Hi Sebastian, Did you see Ted's latest on this thread?
On Jul 19, 2011, at 3:24 AM, Sebastian Schelter wrote: > Finally having internet access here again (I'm on an island at a Cloud > Computing Summerschool with little to no WiFi...) > > I agree with everything that has been said so far. RowSimilarityJob's design > is heavily biased towards genericity (supporting arbitrary user-defined > similarity measures) and therefore the performance is dependent on the shape > of the input and intelligent sampling/pruning. > > However you are completely right that there is much room for improvement as > there are some possibilities to use combiners as already proposed here. I > think that the similarity measures would have to be divided into three > "classes" and RowSimilarityJob needs a specialized implementation for each of > these. I have already planned to invest a little time into that recently, > maybe I'll find time to go at that in the next weeks. > > Class 1 would be count based similarity measures like Tanimoto-coefficient or > LLR that can be easily combined by summing the partial counts. > > Class 2 would be measures that only need the cooccurrences between the > vectors like Pearson-Correlation or Euclidean distance or Cosine if the > vectors are normalized, it should be possible to find intelligent (yet a bit > hacky) ways to combine their intermediate data. I think these can be in class 1 (or at least Euclidean and Cosine, I'm not that familiar w/ Pearson) > > Class 3 would be measures that are possibly user-supplied and need the > "weight" of the input vectors as well as all the cooccurrences. To me, we enforce that user supplied measures fit into class 1. Basically, I believe we shouldn't have to emit a Cooccurrence object ever. Or we split RowSimilarityJob into CooccurrenceRowSimilarityJob (the current approach for those who want total generic plugability) and CountingRowSimilarityJob (for those who want something that will scale) or something like that which better captures what it does. > > I think we definitely need to have a combiner included in RowSimilarityJob (I > tried that some time ago but dropped it as it didn't help my particular > usecase) to increase its performance but we should not sacrifice giving the > users the possibility to create their own similarity measures, foursquare > explicitly spoke high of the possibility of customisation: "On top of this we > layered a package called Mahout, which was relatively easy to modify in order > to compute custom similarity scores." > > RowSimilarityJob is based on papers dealing with document similarity but is > currently mostly/only used for neighbourhood based collaborative filtering > where input matrices are much more sparse usually. I'm using it for doc similarity. > I'd propose that in addition to adding combiners we should create an > additional wrapping job for comparing documents around RowSimilarityJob in > the same way as ItemSimilarityJob wraps it for CF. The DocumentSimilarityJob > could also provide text-specific sampling/pruning strategies like removing > terms with high DF e.g. Actually RowSimilarityJob should not be used on it's > own by people not aware of its "pitfalls" Maybe. We already have DF pruning strategies in Seq2Sparse, I don't think we need to couple the two. > > I also remember that we once had someone on the list that used > RowSimilarityJob for precomputing the similarities between millions of > documents. Unfortunately I couldn't find the conversation yet. IIRC he > successfully applied a very aggressive sampling strategy. Sure, that makes sense, but I also think we should at least be able to match what is in the paper and by my take that should well be within reach. > > --sebastian > > > On 19.07.2011 02:19, Grant Ingersoll wrote: >> >> On Jul 18, 2011, at 6:09 PM, Sean Owen wrote: >> >>> Right! but how do you do that if you only saved co-occurrence counts? >>> >>> You can surely pull a very similarly-shaped trick to calculate the >>> cosine measure; that's exactly what this paper is doing in fact. But >>> it's a different computation. >>> >>> Right now the job saves *all* the info it might need to calculate any >>> of these things later. And that's heavy. >> >> Yes. That is the thing I am questioning. Do we need to do that? I'm >> arguing that doing so makes for an algorithm that doesn't scale, even if it >> is correct. >> >>> >>> On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix<[email protected]> wrote: >>>> On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen<[email protected]> wrote: >>>> >>>>> How do you implement, for instance, the cosine similarity with this >>>>> output? >>>>> That's the intent behind preserving this info, which is surely a lot >>>>> to preserve. >>>>> >>>> >>>> Sorry to jump in the middle of this, but cosine is not too hard to use nice >>>> combiners, as it can be done by first normalizing the rows and then >>>> doing my ubiquitous "outer product of columns" trick on the resultant >>>> corpus (this latter job uses combiners easily because the mappers do all >>>> multiplications, and all reducers are simply sums, and thus are commutative >>>> and associative). >>>> >>>> Not sure about the other fancy similarities. >> >> -------------------------- >> Grant Ingersoll >> >> >> > -------------------------- Grant Ingersoll
