Inline On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Short answer, you are correct this is not a new filter. > > The Hadoop MapReduce implements: > * maxSimilaritiesPerItem > * maxPrefs > * minPrefsPerUser > * threshold > > Scala version: > * maxSimilaritiesPerItem > I think of this as "column-wise", but that may be bad terminology. > * maxPrefs > And I think of this as "row-wise" or "user limit". I think it is the interaction-cut from the paper. > > The paper talks about an interaction-cut, and describes it with "There is > no significant decrease in the error for incorporating more interactions > from the ‘power users’ after that.” While I’d trust your reading better > than mine I thought that meant dowsampling overactive users. > I agree. > > However both the Hadoop Mapreduce and the Scala version downsample both > user and item interactions by maxPrefs. So you are correct, not a new thing. > > The paper also talks about the threshold and we’ve talked on the list > about how better to implement that. A fixed number is not very useful so a > number of sigmas was proposed but is not yet implemented. > I think that both minPrefsPerUser and threshold have limited utility in the current code. Could be wrong about that. With low quality association measures that suffer from low count problems or simplisitic user-based methods, minPrefsPerUser can be crucial. Threshold can also be required for systems like that. The Scala code doesn't have that problem since it doesn't support those metrics.