This indicates that some rows of your data are dense.  Total time for the
job is dominated by such dense rows.

The practical answer is to down-sample these rows to a max size.  This has
no practical important typically since you learn almost nothing new after a
thousand entries or so.

For some distance metrics there is a practical *and* correct answer
available via SVD, but the minimum fuss there is excessive.

On Thu, Jul 14, 2011 at 10:17 AM, Grant Ingersoll <[email protected]>wrote:

> > Could you give some numbers about the size of your input matrix and the
> value of the counter COOCCURRENCES from RowSimilarityJob?
>
> Last I looked it was around 53B before it was killed.

Reply via email to