Been looking at this some more... 

I don't think the paper (or the other one I cited later which has more details) 
output all this co-occurrence temporary stuff that we do.  We seem to be doing 
context.write() inside of an inner loop and simply outputting that a 
relationship exists.  I think the paper calculates as much as it can of the 
score on the map side and then emits the partial score and then the reducer 
does the final sum.  If I'm reading our stuff right, I think we output all the 
co-occurrences and then do the full sum on the reduce.  Perhaps we emit all the 
co-occurrences b/c that allows us to plug in different similarity measures?    
See page 3 of the Jimmy Lin paper I emailed the other day.

Also, could we benefit from a Combiner in the similarity calculation phase?  
That's all algebraic there, right?



On Jul 14, 2011, at 6:26 PM, Sean Owen wrote:

> What's a row here, a user? I completely agree but then this describes how
> you start item-item simiarity computation, where items are columns right?
> The job here is turned on its side, computing row similarity.
> On Jul 14, 2011 11:21 PM, "Ted Dunning" <[email protected]> wrote:
>> The problem arises when the program is reading a single row and emitting
> all
>> of the cooccurring items. The number of items emitted is the square of the
>> number of items in a row. Thus, it is more dense rows that cause the
>> problem.
>> 
>> On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote:
>> 
>>> In their example, docs were rows and words were columns. The terms of
>>> the inner products they computed came from processing the posting
>>> lists / columns instead of rows and emitting all pairs of docs
>>> containing a word. Sounds like they just tossed the posting list for
>>> common words. Anyway that's why I said cols and think that's right. At
>>> least, that is what RowSimilartyJob is doing.
>>> 
>>> On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]>
>>> wrote:
>>>> Rows.
>>>> 
>>>> On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote:
>>>> 
>>>>> Just needs a rule for
>>>>> tossing data -- you could simply throw away such columns (ouch), or at
>>>>> least
>>>>> use only a sampled subset of it.
>>>>> 
>>>> 
>>> 

--------------------------
Grant Ingersoll



Reply via email to