Definitely, well I do something slightly more crude, and omit diffs for
item-item pairs which co-occur less than n times.

Yes an alternate answer is to turn up n, or even simply cap the number of
diffs if desired, to reduce memory usage. That may also prove to be the
better solution.

I feel like there must be a win in getting rid of so many objects either
way.

On Sep 1, 2009 2:00 AM, "Ted Dunning" <[email protected]> wrote:

Sean,

Have you considered simply not storing most of the differences that you
have?

In particular, can you use something like LLR to find the 5% (or even 1%) of
the differences that really matter and just tossing the rest?

That will compress your data better than anything I could do with bit
twiddles.

--
Ted Dunning, CTO
DeepDyve

Reply via email to