FYI: related paper by Lin http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.8326&rep=rep1&type=pdf
Nothing too diff from the original, but goes into a bit more detail and has more comparisons. On Jul 14, 2011, at 6:26 PM, Sean Owen wrote: > What's a row here, a user? I completely agree but then this describes how > you start item-item simiarity computation, where items are columns right? > The job here is turned on its side, computing row similarity. > On Jul 14, 2011 11:21 PM, "Ted Dunning" <[email protected]> wrote: >> The problem arises when the program is reading a single row and emitting > all >> of the cooccurring items. The number of items emitted is the square of the >> number of items in a row. Thus, it is more dense rows that cause the >> problem. >> >> On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote: >> >>> In their example, docs were rows and words were columns. The terms of >>> the inner products they computed came from processing the posting >>> lists / columns instead of rows and emitting all pairs of docs >>> containing a word. Sounds like they just tossed the posting list for >>> common words. Anyway that's why I said cols and think that's right. At >>> least, that is what RowSimilartyJob is doing. >>> >>> On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]> >>> wrote: >>>> Rows. >>>> >>>> On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote: >>>> >>>>> Just needs a rule for >>>>> tossing data -- you could simply throw away such columns (ouch), or at >>>>> least >>>>> use only a sampled subset of it. >>>>> >>>> >>> -------------------------- Grant Ingersoll
