I'm starting to review the "synchronized scans" and "scan-resistant
buffer cache" patches. The patches have complex interactions so I'm
taking a holistic approach.
There's four outstanding issues with the sync scans in particular:
1. The simplistic hash approach. While it's nice to not have a lock, I'm
worried of collisions. If you had a collision every now and then, it
wouldn't be that bad, but because the hash value is computed from the
oid, a collision would be persistent. If you create a database and
happen to have two frequently seqscanned tables that collide, the only
way to get rid of the collision is to drop and recreate a table.
Granted, that'd probably be very rare in practice, but when it happens
it would be next to impossible to figure out what's going on.
Let's use a normal hash table instead, and use a lock to protect it. If
we only update it every 10 pages or so, the overhead should be
negligible. To further reduce contention, we could modify ReadBuffer to
let the caller know if the read resulted in a physical read or not, and
only update the entry when a page is physically read in. That way all
the synchronized scanners wouldn't be updating the same value, just the
one performing the I/O. And while we're at it, let's use the full
relfilenode instead of just the table oid in the hash.
2. Under what circumstances does the patch help and when does it hurt? I
think the patch is safe in that it should never be any worse than what
we have now. But when does it help? That needs to be looked at together
with the other patch.
I need to dig the archives for the performance test results you posted
earlier and try to understand them.
There's six distinct scenarios I've come up with this far that need to
be looked at:
A. A seq scan on a small table
B. A seq scan on a table that's 110% the size of shared_buffers, but
smaller than RAM
C. A seq scan on a table that's 110% the size of RAM
D. A seq scan on a huge table
E. Two simultaneous seq scans on a large table starting at the same time
F. Two simultaneous seq scans on a large table, 2nd one starting when
the 1st one is halfway through
Also, does it change things if you have a bulk update instead of
read-only query? How about bitmap heap scans and large index scans? And
vacuums? And the above scenarios need to be considered both alone, and
in the presence of other OLTP kind of workload.
I realize that we can't have everything, and as long as we get some
significant benefit in some scenarios, and don't hurt others, the patch
is worthwhile. But let's try to cover as much as we reasonably can.
One random idea I had to cover B & C without having the offset variable:
Start scanning *backwards* from the page that's in the shared hash
table, until you hit a page that's not in buffer cache. Then you
continue scanning forwards from the page you started from.
This needs more thought but I think we can come up with a pretty simple
solution that covers the most common cases.
3. By having different backends doing the reads, are we destroying OS
readahead as Tom suggested? I remember you performed some tests on that,
and it was a problem on some systems but not on others. This needs some
thought, there may be some simple way to address that.
4. It fails regression tests. You get an assertion failure on the portal
test. I believe that changing the direction of a scan isn't handled
properly; it's probably pretty easy to fix.
Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought,
and take a closer look at the scan-resistant scans patch. Any comments
and ideas are welcome, of course..
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?