Is this being kept for 8.3?


Tom Lane wrote:
> A couple months ago we were discussing partitioning the buffer mapping
> table so as to reduce contention for the BufMappingLock.  The discussion
> stalled after I complained about the uncertainty of shared memory usage
> arising from having a separate hashtable for each partition (since
> different numbers of buffers might fall into each partition at different
> times).  I've been thinking about it more after seeing an example from
> Tatsuo that seems to be exactly the same problem as Gavin saw.
> We could fix the uncertain-usage objection if we stick with a single
> hashtable and ensure that concurrent access to different partitions of
> it is safe.  I believe this is easily doable, if we make hashtables
> used in this mode allocate the maximum allowed number of buckets
> immediately during hashtable initialization.  (Since shared hashtables
> already have a fixed maximum directory size, they already have an upper
> bound on the number of buckets, so this loses no flexibility.)  Then
> there will be no on-the-fly bucket splits, and that means that accesses
> to different hash buckets operate completely independently.  Therefore,
> we can regard the hashtable as logically partitioned on the basis of any
> classification of entries that will uniquely assign hash buckets to
> classes --- taking the low order bits of entry hash codes will do fine.
> The only changeable state that is shared across all buckets is the entry
> freelist and the "nentries" counter.  We could protect these with a
> spinlock (one spinlock is sufficient since changes to nentries go along
> with addition or removal of freelist entries).
> Usage of a partitioned hash table would then be like
>       compute hashtable lookup key;
>       entryhashcode = calc_hash(lookup key);
>       partitionnumber = entryhashcode % NumPartitions;
>       LWLockAcquire(PartitionLock[partitionnumber], ...);
>       manipulate hashtable;
>       LWLockRelease(PartitionLock[partitionnumber]);
> We could do this without changing the API of hash_search, but then we'd
> be computing the entry hashcode twice, so I'm inclined to provide an
> additional entry point that takes a precalculated hashcode.
> Potential downsides of applying this idea to the buffer mapping table:
> 1. Reassigning a buffer to a new page will (usually) require two cycles
> of LWLockAcquire/Release for the two different partitions involved.
> Since this path also requires at least a read() kernel call (maybe a
> write() too), I don't think there'll be any meaningful slowdown.
> 2. The current logic for reassigning a buffer attempts to make a
> hashtable entry for its new page number (to detect collisions) before
> releasing the old hashtable entry.  This would only be safe if we held
> both partition LWLocks concurrently; which seems bad for concurrency,
> plus avoiding deadlock would complicate the code significantly.  I'm
> inclined to release the old entry and then try to insert the new one,
> holding only one lock at a time.  If the insertion fails (because
> someone was concurrently loading the same page), we'd have to throw the
> buffer onto the freelist rather than allowing it to retain its previous
> valid data.  Which is slightly annoying, but in practice the case
> probably doesn't happen often enough to be worth worrying about.
> 3. Taking the freelist spinlock is new computation that wasn't there
> before.  But, again, it's only needed in code paths that will also be
> doing a kernel call.
> If we do this we should probably also handle the lmgr lock tables the
> same way (partially reverting my partition-the-LockMgrLock patch of a
> couple months ago).  However, downside #3 might be a stronger objection
> for lmgr, since it can create or destroy lock objects without necessarily
> doing any I/O.
> Thoughts, objections?
>                       regards, tom lane
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

  Bruce Momjian   [EMAIL PROTECTED]

  + If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to