Re: RowSimilarity ?'s

Grant Ingersoll Mon, 18 Jul 2011 12:52:35 -0700

On Jul 18, 2011, at 3:31 PM, Sean Owen wrote:

> If the question is, why do we output *coocccurrence* instead of
> something else, like the elements of a vector product in the paper,
> then the answer is that it's to support different similarity metrics.
> In both cases, the algorithms are building up item-item similarity by
> extracting a billion tiny bits of item-item similarity -- whether it's
> cooccurrence or a dot product -- and piecing it together later.
> 
> (I do think the generality comes at a price: it is outputting not just
> a count but a full record of every co-occurrence, which is big. That
> means that you can implement most any common metric downstream with
> that much info preserved.)
> 
> 
> What did you have in mind by map-side? You can't do this all map-side,
> and the paper suggests it emits all the "bits" from the mapper and
> then adds them up in the reducer, which is the same as in the
> RowSimilarityJob procedure. Unless I miss something significant, it's
> the same.


We definitely cannot do this all map side, but I think the big difference is 
the paper emits the partial sum of all the co-occurrences it's seen, whereas we 
emit all the little co-occurrences.  The big difference is between one write of 
what amounts to a few identifying pieces and a double per outer loop iteration 
and one write per inner loop iteration of Cooccurrence, as I understand it.  I 
think that makes for a huge difference in performance.


> 
> 
> If the question is, why do we output *all* cooccurrence then it's
> because in theory you need to count everything to compute all
> item-item similarities. None of those are somehow irrelevant or
> redundant. However, each individual cooccurrence may make very little
> difference in the final result. Pruning is just fine here, and
> RowSimilarityJob does *not* have levers to pull to throw away data at
> this stage, in order to emit fewer cooccurrences. It could.
> 
> 
> That's what the paper does pretty coarsely by just throwing away large
> posting lists; it would not be quite so valid for recommenders.
> Ignoring the word "the" in doc similarity is not the same as ignoring
> a very common album. But, there are simpler rules like what Ted's said
> a few times -- just cap the size of any column (yes, here we mean
> column, not row) by randomly tossing elements past a certain size.
> 

Feature reduction is fine and all, but that isn't the problem here and it has 
nothing to do with what I am talking about.  I'm just saying that if 
RowSimilarityJob is saying it is implementing what is in the paper, then we 
aren't doing it right.    As I said before, I'm doing this on a smaller matrix 
than what is in the paper and on much better hardware and the performance is 
100x (my guess) worse.  It claims 2 hour running time, I've yet to see this 
finish in 24-48 hours.

> 
> The combiner question is interesting. Many metrics happen to be built
> up from some commutative function of smaller parts, like addition. I
> think the answer is 'yes' in many cases -- but not in all cases. For
> example I don't think you can construe log-likelihood this way. I'm
> also not sure how Pearson would come out since you need the mean from
> the start, though I wouldn't be surprised if there's a clever way to
> compute it online.
> 
> 
> I think the big big win for your use case is adding that lever to trim
> down big columns, which is quite easy to add.

I think the big win is to not construe this implementation to be based on what 
is in the paper.  I'm starting to think we should have two RowSimilarity jobs.  
One for algebraic functions and one for those who are not.


> 
> 
> On Mon, Jul 18, 2011 at 7:42 PM, Grant Ingersoll <[email protected]> wrote:
>> Been looking at this some more...
>> 
>> I don't think the paper (or the other one I cited later which has more 
>> details) output all this co-occurrence temporary stuff that we do.  We seem 
>> to be doing context.write() inside of an inner loop and simply outputting 
>> that a relationship exists.  I think the paper calculates as much as it can 
>> of the score on the map side and then emits the partial score and then the 
>> reducer does the final sum.  If I'm reading our stuff right, I think we 
>> output all the co-occurrences and then do the full sum on the reduce.  
>> Perhaps we emit all the co-occurrences b/c that allows us to plug in 
>> different similarity measures?    See page 3 of the Jimmy Lin paper I 
>> emailed the other day.
>> 
>> Also, could we benefit from a Combiner in the similarity calculation phase?  
>> That's all algebraic there, right?
>> 
>> 
>> 
>> On Jul 14, 2011, at 6:26 PM, Sean Owen wrote:
>> 
>>> What's a row here, a user? I completely agree but then this describes how
>>> you start item-item simiarity computation, where items are columns right?
>>> The job here is turned on its side, computing row similarity.
>>> On Jul 14, 2011 11:21 PM, "Ted Dunning" <[email protected]> wrote:
>>>> The problem arises when the program is reading a single row and emitting
>>> all
>>>> of the cooccurring items. The number of items emitted is the square of the
>>>> number of items in a row. Thus, it is more dense rows that cause the
>>>> problem.
>>>> 
>>>> On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote:
>>>> 
>>>>> In their example, docs were rows and words were columns. The terms of
>>>>> the inner products they computed came from processing the posting
>>>>> lists / columns instead of rows and emitting all pairs of docs
>>>>> containing a word. Sounds like they just tossed the posting list for
>>>>> common words. Anyway that's why I said cols and think that's right. At
>>>>> least, that is what RowSimilartyJob is doing.
>>>>> 
>>>>> On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]>
>>>>> wrote:
>>>>>> Rows.
>>>>>> 
>>>>>> On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote:
>>>>>> 
>>>>>>> Just needs a rule for
>>>>>>> tossing data -- you could simply throw away such columns (ouch), or at
>>>>>>> least
>>>>>>> use only a sampled subset of it.
>>>>>>> 
>>>>>> 
>>>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> 
>> 
>> 
>> 

--------------------------
Grant Ingersoll

Re: RowSimilarity ?'s

Reply via email to