On 3 May 2017 at 07:13, Robert Haas <robertmh...@gmail.com> wrote: > It is of course possible that the Parallel Seq Scan could run into > contention problems if the number of workers is large, but in my > experience there are bigger problems here. The non-parallel Seq Scan > can also contend -- not of course over the shared mutex because there > isn't one, but over access to the blocks themselves. Every one of > those blocks has a content lock and a buffer header and so on, and > having multiple processes accessing those things at the same time > scales well, but not perfectly. The Hash node can also contend: if > the hash join spills to disk, you've got multiple processes reading > and writing to the temp directory at the same time and, of course, > that can be worse than just one process doing it -- sometimes much > worse. It can also be better, depending on how much I/O gets > generated and how much I/O bandwidth you have.
Yeah, I did get some time to look over the contention in Parallel Seq Scan a while back and I discovered that on the machine that I was testing on. the lock obtained in heap_parallelscan_nextpage() was causing workers to have to wait for other workers to fetch their next task to work on. I ended up writing the attached (which I'd not intended to post until some time closer to when the doors open for PG11). At the moment it's basically just a test patch to see how it affects things when we give workers a bit more to do before they come back to look for more work. In this case, I've just given them 10 pages to work on, instead of the 1 that's allocated in 9.6 and v10. A quick test on a pretty large table on a large machine shows: Unpatched: postgres=# select count(*) from a; count ------------ 1874000000 (1 row) Time: 5211.485 ms (00:05.211) Patched: postgres=# select count(*) from a; count ------------ 1874000000 (1 row) Time: 2523.983 ms (00:02.524) So it seems worth looking into. "a" was just a table with a single int column. I'm unsure as yet if there are more gains to be had for tables with wider tuples. I do suspect the patch will be a bigger win in those cases, since there's less work to do for each page, e.g less advance aggregate calls, so likely they'll be looking for their next page a bit sooner. Now I'm not going to pretend that this patch is ready for the prime-time. I've not yet worked out how to properly report sync-scan locations without risking reporting later pages after reporting the end of the scan. What I have at the moment could cause a report to be missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch size. I'm also not sure how batching like this affect read-aheads, but at least the numbers above speak for something. Although none of the pages in this case came from disk. I'd had thoughts that the 10 pages wouldn't be constant, but the batching size would depend on the size of the relation to be scanned. I'd rough ideas to just try to make about 1 million batches. Something like batch_pages = Max(parallel_scan->phs_nblocks / 1000000, 1); so that we only take more than 1 page if there's some decent amount to process. We don't want to make the batches too big as we might end up having to wait on slow workers at the end of a scan. Anyway. I don't want to hi-jack this thread with discussions on this. I just wanted to mark that I plan to work on this in order to avoid any repeat developments or analysis. I'll probably start a new thread for this sometime nearer PG11's dev cycle. The patch is attached if in the meantime someone wants to run this on some big hardware. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Description: Binary data
-- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers