Removal is not as important as adding (which can be done). Also removal is 
often for business logic, like removal from a catalog, so a refresh may be 
driven by non-math considerations. Removal of users is only to clean up things, 
not required very often. Removal of items can happen from recs too, mitigating 
the issue.

The way the downsampling works now is to randomly remove interactions if we 
know there will be too many so that we end up with the right amount. The 
incremental approach would filter out all new interactions that are over the 
limit since the old interactions are not kept. This seems to violate the random 
choice of interactions to cut but now that I think about it does a random 
choice really matter?

On Apr 22, 2015, at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.


Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.

Reply via email to