In their example, docs were rows and words were columns. The terms of the inner products they computed came from processing the posting lists / columns instead of rows and emitting all pairs of docs containing a word. Sounds like they just tossed the posting list for common words. Anyway that's why I said cols and think that's right. At least, that is what RowSimilartyJob is doing.
On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]> wrote: > Rows. > > On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote: > >> Just needs a rule for >> tossing data -- you could simply throw away such columns (ouch), or at >> least >> use only a sampled subset of it. >> >
