Jim Meyering <jim <at> meyering.net> writes: > > Eric Blake wrote: > > + /* Guarantee that once we start moving entries into the new table, > > + we will not need any further memory allocations. We only need > > + new allocations if the hasher consolidates multiple buckets from > > + the old size into one bucket with the new size (a sign of a bad > > + hasher, but we must deal with it). Therefore, we need merely > > + check that there are at least table->n_buckets_used-1 entries on > > + the free entry list for the worst case collision (everything gets > > + combined into one bucket during the rehash). */ > > That strikes me as pessimistic. > Imagine that we have a full table, with n_buckets_used == 20 > and n_buckets == 23. > > When rehashing to say 31, odds are very good that we can > get by with *no* entries on the free list -- if we're careful. > So requiring 22 is a recipe for unnecessary failure.
You are correct that in the common case, we probably won't need any extra free entries. And I can prove that in the worst case, we need 19 free entries (not 22), when all 20 buckets of the old table hash to the same bucket of the new (and this would be easy to reproducibly test, by writing a hasher that is well- behaved for the initial bucket size but returns a constant for the new size). So the question is now whether we want to be pessimistic and cause unnecessary memory pressure, or optimistic but have more work to undo if our optimism was wrong. But it is usually the right approach to cater to the common case, so I'm starting to lean towards the optimistic approach with complicated undo if we guessed wrong. My first patch attempt[1] was to provide cleanup on allocation failure; it was only my second version that switched over to the pessimistic version of guaranteeing no possibility of allocation failure to avoid the need for cleanup in the first place. The algorithm you describe below rather closely resembles my first version (unfortunately, my first attempt is not posted in any public repo at the moment). [1] http://article.gmane.org/gmane.comp.lib.gnulib.bugs/17849 > > Right after your initial report, I began rewriting hash_rehash > to work as follows: > > allocate new table > > [ optimization to minimize possibility of unnecessary failure: > in the old table, > liberate any overflow (chained) entries by moving "data" > into empty buckets ] I hadn't thought of that. It frees (n_buckets - n_buckets_used) entries up front, leaving less pressure for the rest of the algorithm. But it is worthless for my pessimistic algorithm (why? because although you have more free entries, you have also increased n_buckets_used by the same amount, so you have ended up only wasting time in moving data). Even for the optimistic approach, it means that we that a failed allocation requires one more pass through the original table to undo this effect. And in the optimal case, it makes no difference (with a perfect hasher, there are no overflow entries in the first place). At this point, I'm inclined to say that it probably won't help, unless we can come up with a contrived hasher where we really do see higher memory pressure mid-move than with either pre- or post- table, such that reducing pressure up front would be worthwhile. > > iterate through old table members, hashing them into the new table, > > [Possible optimization: process all overflow-entries before > any single-entry bucket. But that means two passes. > Alternatively: one pass through buckets, but for chains of > length > 1, process all allocated entries before the first, > non-allocated one, which (hence) might require an allocation > in the new table. ] My first attempt used a full two-pass algorithm for the cleanup case. With one pass, the allocations and frees are intermingled, with two passes all frees occur before any allocations. However, I have been unable (so far) to contrive any such hasher that would show a difference in worst-case pressure with the intermingled operation. So, for better cache locality, I'd rather avoid two passes in the common case. But we definitely want to code things to move the overflow first and bucket head last within each bucket, as that is a trivial rewrite with obvious benefits. > > If, in spite of all that, allocation is required, restore to > the original table any entries that we've moved into the new one > and free the new one and return false. However, restoring them > may be tricky, so we'd have to be careful there, too. I think I've convinced myself that recovery is always possible without further allocation, although I'm still unsure on whether there is ever anything where the cleanup would require a two-pass algorithm because the one-pass approach has higher pressure. > > This approach has the added advantage of decreasing the memory > footprint slightly, since we're more likely to reuse existing > allocated entries than to allocate new ones that would eventually > end up on the free list. It decreases the memory footprint of the table, but does increase the code size. But that's the price of avoiding worst-case pessimism (and besides, a code size increase of even a few k is probably much smaller than the cost of pessimistically overallocating, where real world resizing will probably be in the millions of hash table entries before memory pressure is an issue). Do you want me to resurrect the bits of my first patch, and submit something that recovers from failure rather than preventing it? -- Eric Blake
