Randomizing interaction down-sampling is probably more important on the starting batch since it is done on entire input row or column, not so important when a cut-off is already reached. All new interactions (new items for instance) would not have reached the cut anyway, which is important since one of the big reasons for incremental is to quickly account for new items.
So I guess I agree that there is very little practical difference between incremental streaming and moving window streaming. There is a big difference in implementation and computation time, of course. On Apr 23, 2015, at 5:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote: Removal is not as important as adding (which can be done). Also removal is often for business logic, like removal from a catalog, so a refresh may be driven by non-math considerations. Removal of users is only to clean up things, not required very often. Removal of items can happen from recs too, mitigating the issue. The way the downsampling works now is to randomly remove interactions if we know there will be too many so that we end up with the right amount. The incremental approach would filter out all new interactions that are over the limit since the old interactions are not kept. This seems to violate the random choice of interactions to cut but now that I think about it does a random choice really matter? On Apr 22, 2015, at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I think we have been talking about an idea that does an incremental > approximation, then a refresh every so often to remove any approximation so > in an ideal world we need both. Actually, the method I was pushing is exact. If the sampling is made deterministic using clever seeds, then deletion is even possible since you can determine whether an observation was thrown away rather than used to increment counts. The only creeping crud aspect of this is the accumulation of zero rows as things fall out of the accumulation window. I would be tempted to not allow deletion and just restart as Pat is suggesting.