Tom Lane <[EMAIL PROTECTED]> writes: > Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well, > and is likely faster than any multiple-instruction way to do the same. > > The quoted article seems to be by someone who has spent a lot of time > counting assembly cycles and none at all reading the last thirty years > worth of CS literature. Knuth's treatment of hashing has some actual > math to it...
[incidentally, I just found that the quoted article was independently found by Bruce Momjian who found it convincing enough to convert the hash_any table over to it two years ago] I just reviewed Knuth as well as C.L.R. and several papers from CACM and SIGMOD. It seems we have three options: mod(): Pro: empirical research shows it the best algorithm for avoiding collisions Con: requires the hash table be a prime size and far from a power of two. This is inconvenient to arrange for dynamic tables as used in postgres. multiplication method (floor(tablesize * remainder(x * golden ratio))) Pro: works with any table size Con: I'm not clear if it can be done without floating point arithmetic. It seems like it ought to be possible though. Universal hashing: Pro: won't create gotcha situations where the distribution of data suddenly and unexpectedly creates unexpected performance problems. Con: more complex. It would be trivial to switch the implementation from mod() to the multiplicative method which is more suited to postgres's needs. However universal hashing has some appeal. I hate the idea that a hash join could perform perfectly well one day and suddenly become pessimal when new data is loaded. In a sense universal hashing is less predictable. For a DSS system that could be bad. A query that runs fine every day might fail one day in an unpredictable way even though the data is unchanged. But in another sense it would be more predictable in that if you run the query a thousand times the average performance would be very close to the expected regardless of what the data is. Whereas more traditional algorithms have some patterns of data that will consistently perform badly. It's probably not worth it but postgres could maybe even be tricky and pretest the parameters against the common values in the statistics table, generating new ones if they fail to generate a nice distribution. That doesn't really guarantee anything though, except that those common values would at least be well distributed to start with. -- greg ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings