This indicates that some rows of your data are dense. Total time for the job is dominated by such dense rows.
The practical answer is to down-sample these rows to a max size. This has no practical important typically since you learn almost nothing new after a thousand entries or so. For some distance metrics there is a practical *and* correct answer available via SVD, but the minimum fuss there is excessive. On Thu, Jul 14, 2011 at 10:17 AM, Grant Ingersoll <[email protected]>wrote: > > Could you give some numbers about the size of your input matrix and the > value of the counter COOCCURRENCES from RowSimilarityJob? > > Last I looked it was around 53B before it was killed.
