Removal is not as important as adding (which can be done). Also removal is often for business logic, like removal from a catalog, so a refresh may be driven by non-math considerations. Removal of users is only to clean up things, not required very often. Removal of items can happen from recs too, mitigating the issue.
The way the downsampling works now is to randomly remove interactions if we know there will be too many so that we end up with the right amount. The incremental approach would filter out all new interactions that are over the limit since the old interactions are not kept. This seems to violate the random choice of interactions to cut but now that I think about it does a random choice really matter? On Apr 22, 2015, at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I think we have been talking about an idea that does an incremental > approximation, then a refresh every so often to remove any approximation so > in an ideal world we need both. Actually, the method I was pushing is exact. If the sampling is made deterministic using clever seeds, then deletion is even possible since you can determine whether an observation was thrown away rather than used to increment counts. The only creeping crud aspect of this is the accumulation of zero rows as things fall out of the accumulation window. I would be tempted to not allow deletion and just restart as Pat is suggesting.