"daveg" <[EMAIL PROTECTED]> writes:
> I have a table of (id serial primary key, url text unique) with a few
> hundred million urls that average about 120 bytes each. The url index is
> only used when a possibly new url is to be inserted, but between the data
> and the index this table occupies a large part of the page cache. Any form
> of compression here would be really helpful.
So the problem here is that it's really hard to compress 120 bytes.
Question 1 is how effective our existing compression mechanism would be. Would
it be able to compress them at all? I'll have to think about what it would
take to test this for you though.
To test this for you would probably require a fair amount of work though. I
think you would need a patch which added per-table settable toast targets and
a separate change which let you lower the minimum size to try to compress.
Question 2 is how we can do better. I've thought quite a bit about what it
would take to have some sort of precomputed dictionary which would be used as
a primer for the back references. The problem I foresee is that you'll need a
reference to this dictionary and we already have 8 bytes of overhead. If we're
aiming to compress short strings then starting off with 12+ bytes of overhead
is an awfully big handicap.
I'm also wondering if our lz compression just isn't the right tool for such
short strings. It assumes you'll have fairly long common substrings. If you
don't have any long common substrings or else a backreference of 2-4 bytes
isn't going to save you much. If you're just using a small alphabet of 36
different characters but fairly randomly like in URLs you aren't going to see
much compression. Something like huffman or arithmetic encoding which can
assign meaning to codes smaller than a byte might be more effective. They
require a static dictionary but then that's precisely what I'm thinking of.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster