hash_rehash allocatino failure (was: [PATCH] hash: declare some functions with the warn_unused_result attribute)

Eric Blake Wed, 17 Jun 2009 14:25:15 -0700

Jim Meyering <jim <at> meyering.net> writes:

> 
> Eric Blake wrote:
> > +  /* Guarantee that once we start moving entries into the new table,
> > +     we will not need any further memory allocations.  We only need
> > +     new allocations if the hasher consolidates multiple buckets from
> > +     the old size into one bucket with the new size (a sign of a bad
> > +     hasher, but we must deal with it).  Therefore, we need merely
> > +     check that there are at least table->n_buckets_used-1 entries on
> > +     the free entry list for the worst case collision (everything gets
> > +     combined into one bucket during the rehash).  */
> 
> That strikes me as pessimistic.
> Imagine that we have a full table, with n_buckets_used == 20
> and n_buckets == 23.
> 
> When rehashing to say 31, odds are very good that we can
> get by with *no* entries on the free list -- if we're careful.
> So requiring 22 is a recipe for unnecessary failure.


You are correct that in the common case, we probably won't need any extra free 
entries.  And I can prove that in the worst case, we need 19 free entries (not 
22), when all 20 buckets of the old table hash to the same bucket of the new 
(and this would be easy to reproducibly test, by writing a hasher that is well-
behaved for the initial bucket size but returns a constant for the new size).  
So the question is now whether we want to be pessimistic and cause unnecessary 
memory pressure, or optimistic but have more work to undo if our optimism was 
wrong.  But it is usually the right approach to cater to the common case, so 
I'm starting to lean towards the optimistic approach with complicated undo if 
we guessed wrong.

My first patch attempt[1] was to provide cleanup on allocation failure; it was 
only my second version that switched over to the pessimistic version of 
guaranteeing no possibility of allocation failure to avoid the need for cleanup 
in the first place.  The algorithm you describe below rather closely resembles 
my first version (unfortunately, my first attempt is not posted in any public 
repo at the moment).

[1] http://article.gmane.org/gmane.comp.lib.gnulib.bugs/17849

> 
> Right after your initial report, I began rewriting hash_rehash
> to work as follows:
> 
>   allocate new table
> 
>   [ optimization to minimize possibility of unnecessary failure:
>     in the old table,
>       liberate any overflow (chained) entries by moving "data"
>         into empty buckets ]

I hadn't thought of that.  It frees (n_buckets - n_buckets_used) entries up 
front, leaving less pressure for the rest of the algorithm.  But it is 
worthless for my pessimistic algorithm (why? because although you have more 
free entries, you have also increased n_buckets_used by the same amount, so you 
have ended up only wasting time in moving data).  Even for the optimistic 
approach, it means that we that a failed allocation requires one more pass 
through the original table to undo this effect.  And in the optimal case, it 
makes no difference (with a perfect hasher, there are no overflow entries in 
the first place).  At this point, I'm inclined to say that it probably won't 
help, unless we can come up with a contrived hasher where we really do see 
higher memory pressure mid-move than with either pre- or post- table, such that 
reducing pressure up front would be worthwhile.

> 
>   iterate through old table members, hashing them into the new table,
> 
>     [Possible optimization: process all overflow-entries before
>      any single-entry bucket.  But that means two passes.
>      Alternatively: one pass through buckets, but for chains of
>      length > 1, process all allocated entries before the first,
>      non-allocated one, which (hence) might require an allocation
>      in the new table. ]

My first attempt used a full two-pass algorithm for the cleanup case.  With one 
pass, the allocations and frees are intermingled, with two passes all frees 
occur before any allocations.  However, I have been unable (so far) to contrive 
any such hasher that would show a difference in worst-case pressure with the 
intermingled operation.  So, for better cache locality, I'd rather avoid two 
passes in the common case.  But we definitely want to code things to move the 
overflow first and bucket head last within each bucket, as that is a trivial 
rewrite with obvious benefits.

> 
>    If, in spite of all that, allocation is required, restore to
>    the original table any entries that we've moved into the new one
>    and free the new one and return false.  However, restoring them
>    may be tricky, so we'd have to be careful there, too.

I think I've convinced myself that recovery is always possible without further 
allocation, although I'm still unsure on whether there is ever anything where 
the cleanup would require a two-pass algorithm because the one-pass approach 
has higher pressure.

> 
> This approach has the added advantage of decreasing the memory
> footprint slightly, since we're more likely to reuse existing
> allocated entries than to allocate new ones that would eventually
> end up on the free list.

It decreases the memory footprint of the table, but does increase the code 
size.  But that's the price of avoiding worst-case pessimism (and besides, a 
code size increase of even a few k is probably much smaller than the cost of 
pessimistically overallocating, where real world resizing will probably be in 
the millions of hash table entries before memory pressure is an issue).

Do you want me to resurrect the bits of my first patch, and submit something 
that recovers from failure rather than preventing it?

-- 
Eric Blake

hash_rehash allocatino failure (was: [PATCH] hash: declare some functions with the warn_unused_result attribute)

Reply via email to