Definitely, well I do something slightly more crude, and omit diffs for item-item pairs which co-occur less than n times.
Yes an alternate answer is to turn up n, or even simply cap the number of diffs if desired, to reduce memory usage. That may also prove to be the better solution. I feel like there must be a win in getting rid of so many objects either way. On Sep 1, 2009 2:00 AM, "Ted Dunning" <[email protected]> wrote: Sean, Have you considered simply not storing most of the differences that you have? In particular, can you use something like LLR to find the 5% (or even 1%) of the differences that really matter and just tossing the rest? That will compress your data better than anything I could do with bit twiddles. -- Ted Dunning, CTO DeepDyve
