The problem arises when the program is reading a single row and emitting all of the cooccurring items. The number of items emitted is the square of the number of items in a row. Thus, it is more dense rows that cause the problem.
On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote: > In their example, docs were rows and words were columns. The terms of > the inner products they computed came from processing the posting > lists / columns instead of rows and emitting all pairs of docs > containing a word. Sounds like they just tossed the posting list for > common words. Anyway that's why I said cols and think that's right. At > least, that is what RowSimilartyJob is doing. > > On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]> > wrote: > > Rows. > > > > On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote: > > > >> Just needs a rule for > >> tossing data -- you could simply throw away such columns (ouch), or at > >> least > >> use only a sampled subset of it. > >> > > >
