Jeff Davis wrote:
On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
Let's use a normal hash table instead, and use a lock to protect it. If
we only update it every 10 pages or so, the overhead should be
negligible. To further reduce contention, we could modify ReadBuffer to
let the caller know if the read resulted in a physical read or not, and
only update the entry when a page is physically read in. That way all
the synchronized scanners wouldn't be updating the same value, just the
one performing the I/O. And while we're at it, let's use the full
relfilenode instead of just the table oid in the hash.
What should be the maximum size of this hash table?
Good question. And also, how do you remove entries from it?
I guess the size should somehow be related to number of backends. Each
backend will realistically be doing just 1 or max 2 seq scan at a time.
It also depends on the number of large tables in the databases, but we
don't have that information easily available. How about using just
NBackends? That should be plenty, but wasting a few hundred bytes of
memory won't hurt anyone.
I think you're going to need an LRU list and counter of used entries in
addition to the hash table, and when all entries are in use, remove the
least recently used one.
The thing to keep an eye on is that it doesn't add too much overhead or
lock contention in the typical case when there's no concurrent scans.
For the locking, use a LWLock.
Is there already-
existing hash table code that I should use to be consistent with the
rest of the code?
Yes, see utils/hash/dynahash.c, and ShmemInitHash (in
storage/ipc/shmem.c) since it's in shared memory. There's plenty of
examples that use hash tables, see for example
storage/freespace/freespace.c.
I'm still trying to understand the effect of using the full relfilenode.
Do you mean using the entire relation _segment_ as the key? That doesn't
make sense to me. Or do you just mean using the relfilenode (without the
segment) as the key?
No, not the segment. RelFileNode consists of tablespace oid, database
oid and relation oid. You can find it in scan->rs_rd->rd_node. The
segmentation works at a lower level.
Linux with CFQ I/O scheduler performs very poorly and inconsistently
with concurrent sequential scans regardless of whether the scans are
synchronized or not. I suspect the reason for this is that CFQ is
designed to care more about the process issuing the request than any
other factor.
Every other I/O system performed either ideally (no interference between
scans) or had some interference but still much better than current
behavior.
Hmm. Should we care then? CFG is the default on Linux, and an average
sysadmin is unlikely to change it.
What we could do quite easily is:
- when ReadBuffer is called, let the caller know if the read did
physical I/O.
- when the previous ReadBuffer didn't result in physical I/O, assume
that we're not the pack leader. If the next buffer isn't already in
cache, wait a few milliseconds before initiating the read, giving the
pack leader a chance to do it instead.
Needs testing, of course..
4. It fails regression tests. You get an assertion failure on the portal
test. I believe that changing the direction of a scan isn't handled
properly; it's probably pretty easy to fix.
I will examine the code more carefully. As a first guess, is it possible
that test is failing because of the non-deterministic order in which
tuples are returned?
No, it's an assertion failure, not just different output than expected.
But it's probably quite simple to fix..
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match