Currently maxPrefs is applied to input, both row and column (in hadoop and scala) and has a default of 500. maxSimilaritiesPerItem is for the cooccurrence matrix and is applied to rows. The default is 50. Similar down-sampling is done on row similarity.
For a new way to use threshold I was thinking of one that is relative to the data itself and would always produce the same number of items in the input but based only on a quailty threshold, not row and column counts. From Sebastian’s paper this may not produce much benefit and the downside is that the input distribution parameters must be calculated before sparsification. This is avoided with a fixed threshold and/or row and column count downsampling. BTW there is another half-way method to do part of this by juggling DStreams and RDDs. Trade-offs apply of course. The idea would be to make Cooccurrence a streaming operation fed by an Update Period of micro-batches. Keeping the input as a DStream allows us to drop old data when new nano-batches come in but the entire time window is potentially large, maybe months for long lived items. The time window would be fed to Cooccurrence periodically. The benefit is that the process never reads persisted data (a fairly time consuming operation with nano-batches) but is passed new RDDs that have come from some streaming input (Kafka?) The downside is that it still needs the entire time window’s worth of data for the calc. In Spark terms the input is taken from a DStream. I think we have been talking about an idea that does an incremental approximation, then a refresh every so often to remove any approximation so in an ideal world we need both. Streaming but non incremental would be relatively easy and use current math code. Incremental would require in-memory data structures of custom design. On Apr 19, 2015, at 8:39 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: Inline On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > Short answer, you are correct this is not a new filter. > > The Hadoop MapReduce implements: > * maxSimilaritiesPerItem > * maxPrefs > * minPrefsPerUser > * threshold > > Scala version: > * maxSimilaritiesPerItem > I think of this as "column-wise", but that may be bad terminology. > * maxPrefs > And I think of this as "row-wise" or "user limit". I think it is the interaction-cut from the paper. > > The paper talks about an interaction-cut, and describes it with "There is > no significant decrease in the error for incorporating more interactions > from the ‘power users’ after that.” While I’d trust your reading better > than mine I thought that meant dowsampling overactive users. > I agree. > > However both the Hadoop Mapreduce and the Scala version downsample both > user and item interactions by maxPrefs. So you are correct, not a new thing. > > The paper also talks about the threshold and we’ve talked on the list > about how better to implement that. A fixed number is not very useful so a > number of sigmas was proposed but is not yet implemented. > I think that both minPrefsPerUser and threshold have limited utility in the current code. Could be wrong about that. With low quality association measures that suffer from low count problems or simplisitic user-based methods, minPrefsPerUser can be crucial. Threshold can also be required for systems like that. The Scala code doesn't have that problem since it doesn't support those metrics.