On Mon, Jul 14, 2014 at 2:53 PM, Peter Geoghegan <p...@heroku.com> wrote: > My concern is that it won't be worth it to do the extra work, > particularly given that I already have 8 bytes to work with. Supposing > I only had 4 bytes to work with (as researchers writing [2] may have > only had in 1994), that would leave me with a relatively small number > of distinct normalized keys in many representative cases. For example, > I'd have a mere 40,665 distinct normalized keys in the case of my > "cities" database, rather than 243,782 (out of a set of 317,102 rows) > for 8 bytes of storage. But if I double that to 16 bytes (which might > be taken as a proxy for what a good compression scheme could get me), > I only get a modest improvement - 273,795 distinct keys. To be fair, > that's in no small part because there are only 275,330 distinct city > names overall (and so most dups get away with a cheap memcmp() on > their tie-breaker), but this is a reasonably organic, representative > dataset.
Are those numbers measured on MAC's strxfrm? That was the one with suboptimal entropy on the first 8 bytes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers