Randomizing interaction down-sampling is probably more important on the 
starting batch since it is done on entire input row or column, not so important 
when a cut-off is already reached. All new interactions (new items for 
instance) would not have reached the cut anyway, which is important since one 
of the big reasons for incremental is to quickly account for new items.

So I guess I agree that there is very little practical difference between 
incremental streaming and moving window streaming. There is a big difference in 
implementation and computation time, of course.
 
On Apr 23, 2015, at 5:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Removal is not as important as adding (which can be done). Also removal is 
often for business logic, like removal from a catalog, so a refresh may be 
driven by non-math considerations. Removal of users is only to clean up things, 
not required very often. Removal of items can happen from recs too, mitigating 
the issue.

The way the downsampling works now is to randomly remove interactions if we 
know there will be too many so that we end up with the right amount. The 
incremental approach would filter out all new interactions that are over the 
limit since the old interactions are not kept. This seems to violate the random 
choice of interactions to cut but now that I think about it does a random 
choice really matter?

On Apr 22, 2015, at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.


Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.


Reply via email to