Jeff Davis wrote:
On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
Let's use a normal hash table instead, and use a lock to protect it. If we only update it every 10 pages or so, the overhead should be negligible. To further reduce contention, we could modify ReadBuffer to let the caller know if the read resulted in a physical read or not, and only update the entry when a page is physically read in. That way all the synchronized scanners wouldn't be updating the same value, just the one performing the I/O. And while we're at it, let's use the full relfilenode instead of just the table oid in the hash.


What should be the maximum size of this hash table?

Good question. And also, how do you remove entries from it?

I guess the size should somehow be related to number of backends. Each backend will realistically be doing just 1 or max 2 seq scan at a time. It also depends on the number of large tables in the databases, but we don't have that information easily available. How about using just NBackends? That should be plenty, but wasting a few hundred bytes of memory won't hurt anyone.

I think you're going to need an LRU list and counter of used entries in addition to the hash table, and when all entries are in use, remove the least recently used one.

The thing to keep an eye on is that it doesn't add too much overhead or lock contention in the typical case when there's no concurrent scans.

For the locking, use a LWLock.

Is there already-
existing hash table code that I should use to be consistent with the
rest of the code?

Yes, see utils/hash/dynahash.c, and ShmemInitHash (in storage/ipc/shmem.c) since it's in shared memory. There's plenty of examples that use hash tables, see for example storage/freespace/freespace.c.

I'm still trying to understand the effect of using the full relfilenode.
Do you mean using the entire relation _segment_ as the key? That doesn't
make sense to me. Or do you just mean using the relfilenode (without the
segment) as the key?

No, not the segment. RelFileNode consists of tablespace oid, database oid and relation oid. You can find it in scan->rs_rd->rd_node. The segmentation works at a lower level.

Linux with CFQ I/O scheduler performs very poorly and inconsistently
with concurrent sequential scans regardless of whether the scans are
synchronized or not. I suspect the reason for this is that CFQ is
designed to care more about the process issuing the request than any
other factor.

Every other I/O system performed either ideally (no interference between
scans) or had some interference but still much better than current
behavior.

Hmm. Should we care then? CFG is the default on Linux, and an average sysadmin is unlikely to change it.

What we could do quite easily is:
- when ReadBuffer is called, let the caller know if the read did physical I/O. - when the previous ReadBuffer didn't result in physical I/O, assume that we're not the pack leader. If the next buffer isn't already in cache, wait a few milliseconds before initiating the read, giving the pack leader a chance to do it instead.

Needs testing, of course..

4. It fails regression tests. You get an assertion failure on the portal test. I believe that changing the direction of a scan isn't handled properly; it's probably pretty easy to fix.


I will examine the code more carefully. As a first guess, is it possible
that test is failing because of the non-deterministic order in which
tuples are returned?

No, it's an assertion failure, not just different output than expected. But it's probably quite simple to fix..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to