"daveg" <[EMAIL PROTECTED]> writes: > I have a table of (id serial primary key, url text unique) with a few > hundred million urls that average about 120 bytes each. The url index is > only used when a possibly new url is to be inserted, but between the data > and the index this table occupies a large part of the page cache. Any form > of compression here would be really helpful.
So the problem here is that it's really hard to compress 120 bytes. Question 1 is how effective our existing compression mechanism would be. Would it be able to compress them at all? I'll have to think about what it would take to test this for you though. To test this for you would probably require a fair amount of work though. I think you would need a patch which added per-table settable toast targets and a separate change which let you lower the minimum size to try to compress. Question 2 is how we can do better. I've thought quite a bit about what it would take to have some sort of precomputed dictionary which would be used as a primer for the back references. The problem I foresee is that you'll need a reference to this dictionary and we already have 8 bytes of overhead. If we're aiming to compress short strings then starting off with 12+ bytes of overhead is an awfully big handicap. I'm also wondering if our lz compression just isn't the right tool for such short strings. It assumes you'll have fairly long common substrings. If you don't have any long common substrings or else a backreference of 2-4 bytes isn't going to save you much. If you're just using a small alphabet of 36 different characters but fairly randomly like in URLs you aren't going to see much compression. Something like huffman or arithmetic encoding which can assign meaning to codes smaller than a byte might be more effective. They require a static dictionary but then that's precisely what I'm thinking of. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster