On 3 May 2017 at 07:13, Robert Haas <robertmh...@gmail.com> wrote:
> It is of course possible that the Parallel Seq Scan could run into
> contention problems if the number of workers is large, but in my
> experience there are bigger problems here.  The non-parallel Seq Scan
> can also contend -- not of course over the shared mutex because there
> isn't one, but over access to the blocks themselves.  Every one of
> those blocks has a content lock and a buffer header and so on, and
> having multiple processes accessing those things at the same time
> scales well, but not perfectly.  The Hash node can also contend: if
> the hash join spills to disk, you've got multiple processes reading
> and writing to the temp directory at the same time and, of course,
> that can be worse than just one process doing it -- sometimes much
> worse.  It can also be better, depending on how much I/O gets
> generated and how much I/O bandwidth you have.

Yeah, I did get some time to look over the contention in Parallel Seq
Scan a while back and I discovered that on the machine that I was
testing on. the lock obtained in heap_parallelscan_nextpage() was
causing workers to have to wait for other workers to fetch their next
task to work on.

I ended up writing the attached (which I'd not intended to post until
some time closer to when the doors open for PG11). At the moment it's
basically just a test patch to see how it affects things when we give
workers a bit more to do before they come back to look for more work.
In this case, I've just given them 10 pages to work on, instead of the
1 that's allocated in 9.6 and v10.

A quick test on a pretty large table on a large machine shows:


postgres=# select count(*) from a;
(1 row)

Time: 5211.485 ms (00:05.211)


postgres=# select count(*) from a;
(1 row)

Time: 2523.983 ms (00:02.524)

So it seems worth looking into. "a" was just a table with a single int
column. I'm unsure as yet if there are more gains to be had for tables
with wider tuples. I do suspect the patch will be a bigger win in
those cases, since there's less work to do for each page, e.g less
advance aggregate calls, so likely they'll be looking for their next
page a bit sooner.

Now I'm not going to pretend that this patch is ready for the
prime-time. I've not yet worked out how to properly report sync-scan
locations without risking reporting later pages after reporting the
end of the scan. What I have at the moment could cause a report to be
missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
size. I'm also not sure how batching like this affect read-aheads, but
at least the numbers above speak for something. Although none of the
pages in this case came from disk.

I'd had thoughts that the 10 pages wouldn't be constant, but the
batching size would depend on the size of the relation to be scanned.
I'd rough ideas to just try to make about 1 million batches. Something
like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so
that we only take more than 1 page if there's some decent amount to
process. We don't want to make the batches too big as we might end up
having to wait on slow workers at the end of a scan.

Anyway. I don't want to hi-jack this thread with discussions on this.
I just wanted to mark that I plan to work on this in order to avoid
any repeat developments or analysis. I'll probably start a new thread
for this sometime nearer PG11's dev cycle.

The patch is attached if in the meantime someone wants to run this on
some big hardware.

 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment: parallel_nextpage_batching_v2.patch
Description: Binary data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to