The problem arises when the program is reading a single row and emitting all
of the cooccurring items.  The number of items emitted is the square of the
number of items in a row.  Thus, it is more dense rows that cause the
problem.

On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote:

> In their example, docs were rows and words were columns. The terms of
> the inner products they computed came from processing the posting
> lists / columns instead of rows and emitting all pairs of docs
> containing a word. Sounds like they just tossed the posting list for
> common words. Anyway that's why I said cols and think that's right. At
> least, that is what RowSimilartyJob is doing.
>
> On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]>
> wrote:
> > Rows.
> >
> > On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote:
> >
> >> Just needs a rule for
> >> tossing data -- you could simply throw away such columns (ouch), or at
> >> least
> >> use only a sampled subset of it.
> >>
> >
>

Reply via email to