Currently maxPrefs is applied to input, both row and column (in hadoop and 
scala) and has a default of 500. maxSimilaritiesPerItem is for the cooccurrence 
matrix and is applied to rows. The default is 50. Similar down-sampling is done 
on row similarity.

For a new way to use threshold I was thinking of one that is relative to the 
data itself and would always produce the same number of items in the input but 
based only on a quailty threshold, not row and column counts. From Sebastian’s 
paper this may not produce much benefit and the downside is that the input 
distribution parameters must be calculated before sparsification. This is 
avoided with a fixed threshold and/or row and column count downsampling.

BTW there is another half-way method to do part of this by juggling DStreams 
and RDDs. Trade-offs apply of course. 

The idea would be to make Cooccurrence a streaming operation fed by an Update 
Period of micro-batches. Keeping the input as a DStream allows us to drop old 
data when new nano-batches come in but the entire time window is potentially 
large, maybe months for long lived items. The time window would be fed to 
Cooccurrence periodically.

The benefit is that the process never reads persisted data (a fairly time 
consuming operation with nano-batches) but is passed new RDDs that have come 
from some streaming input (Kafka?)
The downside is that it still needs the entire time window’s worth of data for 
the calc. In Spark terms the input is taken from a DStream.

I think we have been talking about an idea that does an incremental 
approximation, then a refresh every so often to remove any approximation so in 
an ideal world we need both.

Streaming but non incremental would be relatively easy and use current math 
code. Incremental would require in-memory data structures of custom design. 


On Apr 19, 2015, at 8:39 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

Inline

On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Short answer, you are correct this is not a new filter.
> 
> The Hadoop MapReduce implements:
> * maxSimilaritiesPerItem
> * maxPrefs
> * minPrefsPerUser
> * threshold
> 
> Scala version:
> * maxSimilaritiesPerItem
> 

I think of this as "column-wise", but that may be bad terminology.


> * maxPrefs
> 

And I think of this as "row-wise" or "user limit".  I think it is the
interaction-cut from the paper.


> 
> The paper talks about an interaction-cut, and describes it with "There is
> no significant decrease in the error for incorporating more interactions
> from the ‘power users’ after that.” While I’d trust your reading better
> than mine I thought that meant dowsampling overactive users.
> 

I agree.



> 
> However both the Hadoop Mapreduce and the Scala version downsample both
> user and item interactions by maxPrefs. So you are correct, not a new thing.
> 
> The paper also talks about the threshold and we’ve talked on the list
> about how better to implement that. A fixed number is not very useful so a
> number of sigmas was proposed but is not yet implemented.
> 

I think that both  minPrefsPerUser and threshold have limited utility in
the current code.  Could be wrong about that.

With low quality association measures that suffer from low count problems
or simplisitic user-based methods, minPrefsPerUser can be crucial.
Threshold can also be required for systems like that.

The Scala code doesn't have that problem since it doesn't support those
metrics.

Reply via email to