Re: [HACKERS] Sequential scans

Heikki Linnakangas Wed, 02 May 2007 13:01:49 -0700

Jeff Davis wrote:

On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
Let's use a normal hash table instead, and use a lock to protect it. Ifwe only update it every 10 pages or so, the overhead should benegligible. To further reduce contention, we could modify ReadBuffer tolet the caller know if the read resulted in a physical read or not, andonly update the entry when a page is physically read in. That way allthe synchronized scanners wouldn't be updating the same value, just theone performing the I/O. And while we're at it, let's use the fullrelfilenode instead of just the table oid in the hash.
What should be the maximum size of this hash table?


Good question. And also, how do you remove entries from it?

I guess the size should somehow be related to number of backends. Eachbackend will realistically be doing just 1 or max 2 seq scan at a time.It also depends on the number of large tables in the databases, but wedon't have that information easily available. How about using justNBackends? That should be plenty, but wasting a few hundred bytes ofmemory won't hurt anyone.

I think you're going to need an LRU list and counter of used entries inaddition to the hash table, and when all entries are in use, remove theleast recently used one.

The thing to keep an eye on is that it doesn't add too much overhead orlock contention in the typical case when there's no concurrent scans.


For the locking, use a LWLock.

Is there already-
existing hash table code that I should use to be consistent with the
rest of the code?

Yes, see utils/hash/dynahash.c, and ShmemInitHash (instorage/ipc/shmem.c) since it's in shared memory. There's plenty ofexamples that use hash tables, see for examplestorage/freespace/freespace.c.

I'm still trying to understand the effect of using the full relfilenode.
Do you mean using the entire relation _segment_ as the key? That doesn't
make sense to me. Or do you just mean using the relfilenode (without the
segment) as the key?

No, not the segment. RelFileNode consists of tablespace oid, databaseoid and relation oid. You can find it in scan->rs_rd->rd_node. Thesegmentation works at a lower level.

Linux with CFQ I/O scheduler performs very poorly and inconsistently
with concurrent sequential scans regardless of whether the scans are
synchronized or not. I suspect the reason for this is that CFQ is
designed to care more about the process issuing the request than any
other factor.

Every other I/O system performed either ideally (no interference between
scans) or had some interference but still much better than current
behavior.

Hmm. Should we care then? CFG is the default on Linux, and an averagesysadmin is unlikely to change it.


What we could do quite easily is:

- when ReadBuffer is called, let the caller know if the read didphysical I/O.- when the previous ReadBuffer didn't result in physical I/O, assumethat we're not the pack leader. If the next buffer isn't already incache, wait a few milliseconds before initiating the read, giving thepack leader a chance to do it instead.


Needs testing, of course..

4. It fails regression tests. You get an assertion failure on the portaltest. I believe that changing the direction of a scan isn't handledproperly; it's probably pretty easy to fix.
I will examine the code more carefully. As a first guess, is it possible
that test is failing because of the non-deterministic order in which
tuples are returned?

No, it's an assertion failure, not just different output than expected.But it's probably quite simple to fix..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [HACKERS] Sequential scans

Reply via email to