FYI: related paper by Lin 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.8326&rep=rep1&type=pdf

Nothing too diff from the original, but goes into a bit more detail and has 
more comparisons.  

On Jul 14, 2011, at 6:26 PM, Sean Owen wrote:

> What's a row here, a user? I completely agree but then this describes how
> you start item-item simiarity computation, where items are columns right?
> The job here is turned on its side, computing row similarity.
> On Jul 14, 2011 11:21 PM, "Ted Dunning" <[email protected]> wrote:
>> The problem arises when the program is reading a single row and emitting
> all
>> of the cooccurring items. The number of items emitted is the square of the
>> number of items in a row. Thus, it is more dense rows that cause the
>> problem.
>> 
>> On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote:
>> 
>>> In their example, docs were rows and words were columns. The terms of
>>> the inner products they computed came from processing the posting
>>> lists / columns instead of rows and emitting all pairs of docs
>>> containing a word. Sounds like they just tossed the posting list for
>>> common words. Anyway that's why I said cols and think that's right. At
>>> least, that is what RowSimilartyJob is doing.
>>> 
>>> On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]>
>>> wrote:
>>>> Rows.
>>>> 
>>>> On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote:
>>>> 
>>>>> Just needs a rule for
>>>>> tossing data -- you could simply throw away such columns (ouch), or at
>>>>> least
>>>>> use only a sampled subset of it.
>>>>> 
>>>> 
>>> 

--------------------------
Grant Ingersoll



Reply via email to