Hi, A while back I did some benchmarking on a big 4 socket machine to explore a bit around the outer limits of parallel aggregates. I discovered along the way that, given enough workers, and a simple enough task, that seq-scan workers were held up waiting for the lock to be released in heap_parallelscan_nextpage().
I've since done a little work in this area to try to improve things. I ended up posting about it yesterday in [1]. My original patch used batching to solve the issue; instead of allocating 1 block at a time, the batching patch allocated a range of 10 blocks for the worker to process. However, the implementation still needed a bit of work around reporting sync-scan locations. Andres mentioned in [2] that it might be worth exploring using atomics to do the same job. So I went ahead and did that, and came up with the attached, which is a slight variation on what he mentioned in the thread. To keep things a bit more simple, and streamline, I ended up pulling out the logic for setting the startblock into another function, which we only call once before the first call to heap_parallelscan_nextpage(). I also ended up changing phs_cblock and replacing it with a counter that always starts at zero. The actual block is calculated based on that + the startblock modulo nblocks. This makes things a good bit more simple for detecting when we've allocated all the blocks to the workers, and also works nicely when wrapping back to the start of a relation when we started somewhere in the middle due to piggybacking with a synchronous scan. Performance: With parallel_workers=71, it looks something like: Query 1: 881 GB, ~6 billion row TPC-H lineitem table. tpch=# select count(*) from lineitem; count ------------ 5999989709 (1 row) -- Master Time: 123421.283 ms (02:03.421) Time: 118895.846 ms (01:58.896) Time: 118632.546 ms (01:58.633) -- Atomics patch Time: 74038.813 ms (01:14.039) Time: 73166.200 ms (01:13.166) Time: 72492.338 ms (01:12.492) -- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage() Time: 76364.215 ms (01:16.364) Time: 75808.900 ms (01:15.809) Time: 74927.756 ms (01:14.928) Query 2: Single int column table with 2 billion rows. tpch=# select count(*) from a; count ------------ 2000000000 (1 row) -- Master Time: 5853.918 ms (00:05.854) Time: 5925.633 ms (00:05.926) Time: 5859.223 ms (00:05.859) -- Atomics patch Time: 5825.745 ms (00:05.826) Time: 5849.139 ms (00:05.849) Time: 5815.818 ms (00:05.816) -- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage() Time: 5789.237 ms (00:05.789) Time: 5837.395 ms (00:05.837) Time: 5821.492 ms (00:05.821) I've also attached a text file with the perf report for the lineitem query. You'll notice that the heap_parallelscan_nextpage() is very visible in master, but not on each of the two patches. With the 2nd query, heap_parallelscan_nextpage is fairly insignificant on master's profile, it's only showing up as 0.48%. Likely this must be due to more tuples being read from the page, and more aggregation work getting done before the next page is needed. I'm uncertain why I previously saw a speed up in this case in [1]. I've also noticed that both the atomics patch and unpatched master do something that looks a bit weird with synchronous seq-scans. If the parallel seq-scan piggybacked on another scan, then subsequent parallel scans will start at the same non-zero block location, even when no other concurrent scans exist. I'd have expected this should go back to block 0 again, but maybe I'm just failing to understand the reason for reporting the startblock to ss_report_location() at the end of the scan. I'll now add this to the first commitfest of pg11. I just wanted to note that I've done this, so that it's less likely someone else goes and repeats the same work. [1] https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/20170505023646.3uhnmf2hbwtm63lc%40alap3.anarazel.de -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
-- Unpatched select count(*) from lineitem; 71 workers Children Self Command Shared Object Symbol + 99.83% 0.00% postgres libpthread-2.17.so [.] __restore_rt + 99.83% 0.00% postgres postgres [.] sigusr1_handler + 99.83% 0.00% postgres postgres [.] maybe_start_bgworkers + 99.83% 0.00% postgres postgres [.] do_start_bgworker + 99.83% 0.93% postgres postgres [.] ExecProcNode + 99.83% 0.00% postgres postgres [.] StartBackgroundWorker + 99.83% 0.00% postgres postgres [.] ParallelWorkerMain + 99.83% 0.00% postgres postgres [.] ParallelQueryMain + 99.83% 0.00% postgres postgres [.] ExecutorRun + 99.83% 0.00% postgres postgres [.] standard_ExecutorRun + 99.83% 0.00% postgres postgres [.] ExecutePlan + 99.83% 0.00% postgres postgres [.] ExecAgg + 99.69% 0.62% postgres postgres [.] agg_retrieve_direct + 91.28% 0.63% postgres postgres [.] fetch_input_tuple + 89.75% 0.17% postgres postgres [.] ExecSeqScan + 89.41% 1.07% postgres postgres [.] ExecScan + 84.46% 0.24% postgres postgres [.] ExecScanFetch + 84.26% 0.54% postgres postgres [.] SeqNext + 83.05% 0.44% postgres postgres [.] heap_getnext + 82.61% 2.00% postgres postgres [.] heapgettup_pagemode + 57.81% 0.84% postgres postgres [.] heap_parallelscan_nextpage + 53.46% 43.38% postgres postgres [.] s_lock + 21.82% 1.24% postgres postgres [.] heapgetpage + 20.69% 1.16% postgres postgres [.] ReadBufferExtended + 19.52% 0.04% postgres postgres [.] ReadBuffer_common + 13.75% 0.00% postgres [kernel.kallsyms] [k] system_call_fastpath + 11.54% 0.08% postgres postgres [.] BufferAlloc + 8.21% 8.16% postgres postgres [.] tas + 7.71% 0.03% postgres postgres [.] smgrread + 7.68% 0.02% postgres postgres [.] mdread + 7.22% 1.84% postgres postgres [.] advance_aggregates + 7.21% 0.03% postgres libpthread-2.17.so [.] __read_nocancel + 7.05% 0.02% postgres [kernel.kallsyms] [k] sys_read + 7.01% 0.02% postgres [kernel.kallsyms] [k] vfs_read + 6.88% 0.04% postgres [kernel.kallsyms] [k] do_sync_read + 6.85% 0.08% postgres [xfs] [k] xfs_file_aio_read + 6.39% 0.01% postgres [kernel.kallsyms] [k] sys_futex + 6.39% 0.01% postgres [kernel.kallsyms] [k] do_futex + 6.27% 0.07% postgres [kernel.kallsyms] [k] generic_file_aio_read + 5.82% 0.11% postgres [kernel.kallsyms] [k] get_futex_key + 5.15% 0.07% postgres postgres [.] LWLockAcquire + 4.82% 4.82% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 4.52% 4.52% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 3.92% 0.09% postgres libpthread-2.17.so [.] sem_wait + 3.80% 0.02% postgres [kernel.kallsyms] [k] futex_wait + 3.61% 0.02% postgres [kernel.kallsyms] [k] futex_wait_setup + 3.34% 1.49% postgres postgres [.] ExecProject + 3.31% 3.29% postgres postgres [.] tas + 3.27% 0.10% postgres postgres [.] LWLockRelease + 3.04% 0.01% postgres [kernel.kallsyms] [k] __lock_page + 3.02% 0.09% postgres [kernel.kallsyms] [k] __wait_on_bit_lock + 3.01% 1.09% postgres postgres [.] ExecProject + 2.79% 0.06% postgres postgres [.] LWLockWakeup + 2.67% 0.06% postgres libpthread-2.17.so [.] sem_post + 2.58% 0.04% postgres [kernel.kallsyms] [k] futex_wake + 2.35% 0.02% postgres [kernel.kallsyms] [k] unlock_page + 2.31% 0.05% postgres [kernel.kallsyms] [k] __wake_up_bit + 2.26% 0.01% postgres [kernel.kallsyms] [k] __wake_up + 2.25% 0.09% postgres [kernel.kallsyms] [k] prepare_to_wait_exclusive + 2.23% 1.48% postgres postgres [.] hash_search_with_hash_value + 2.01% 1.01% postgres postgres [.] advance_transition_function + 1.85% 1.83% postgres postgres [.] ExecClearTuple + 1.28% 0.98% postgres postgres [.] perform_spin_delay + 1.14% 0.10% postgres postgres [.] LWLockAttemptLock -- Atomics patch select count(*) from lineitem; 71 workers + 99.58% 0.00% postgres postgres [.] StartBackgroundWorker + 99.58% 0.00% postgres postgres [.] ParallelWorkerMain + 99.58% 2.22% postgres postgres [.] ExecProcNode + 99.58% 0.00% postgres postgres [.] ParallelQueryMain + 99.58% 0.00% postgres postgres [.] ExecutorRun + 99.58% 0.00% postgres postgres [.] standard_ExecutorRun + 99.58% 0.00% postgres postgres [.] ExecutePlan + 99.58% 0.00% postgres postgres [.] ExecAgg + 99.33% 1.26% postgres postgres [.] agg_retrieve_direct + 82.33% 1.41% postgres postgres [.] fetch_input_tuple + 78.65% 0.37% postgres postgres [.] ExecSeqScan + 77.94% 1.99% postgres postgres [.] ExecScan + 69.07% 0.47% postgres postgres [.] ExecScanFetch + 68.65% 1.10% postgres postgres [.] SeqNext + 66.15% 0.89% postgres postgres [.] heap_getnext + 65.28% 4.12% postgres postgres [.] heapgettup_pagemode + 60.01% 1.99% postgres postgres [.] heapgetpage + 56.58% 0.07% postgres postgres [.] ReadBufferExtended + 56.49% 0.11% postgres postgres [.] ReadBuffer_common + 44.37% 0.00% postgres [kernel.kallsyms] [k] system_call_fastpath + 41.13% 0.18% postgres postgres [.] BufferAlloc + 30.34% 0.02% postgres [kernel.kallsyms] [k] sys_futex + 30.31% 0.02% postgres [kernel.kallsyms] [k] do_futex + 28.19% 0.50% postgres [kernel.kallsyms] [k] get_futex_key + 21.93% 21.93% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 21.42% 0.15% postgres postgres [.] LWLockAcquire + 18.82% 0.33% postgres libpthread-2.17.so [.] sem_wait + 18.36% 0.08% postgres [kernel.kallsyms] [k] futex_wait + 17.67% 0.05% postgres [kernel.kallsyms] [k] futex_wait_setup + 15.32% 0.05% postgres [kernel.kallsyms] [k] __lock_page + 15.24% 0.44% postgres [kernel.kallsyms] [k] __wait_on_bit_lock + 14.76% 0.06% postgres postgres [.] smgrread + 14.70% 0.07% postgres postgres [.] mdread + 14.63% 3.64% postgres postgres [.] advance_aggregates + 13.69% 0.05% postgres libpthread-2.17.so [.] __read_nocancel + 13.60% 0.22% postgres postgres [.] LWLockRelease + 13.38% 0.05% postgres [kernel.kallsyms] [k] sys_read + 13.27% 0.06% postgres [kernel.kallsyms] [k] vfs_read + 13.01% 0.08% postgres [kernel.kallsyms] [k] do_sync_read + 12.94% 0.13% postgres [xfs] [k] xfs_file_aio_read + 12.68% 0.21% postgres postgres [.] LWLockWakeup + 12.27% 0.21% postgres libpthread-2.17.so [.] sem_post + 11.93% 0.14% postgres [kernel.kallsyms] [k] futex_wake + 11.87% 0.13% postgres [kernel.kallsyms] [k] generic_file_aio_read + 11.53% 0.48% postgres [kernel.kallsyms] [k] prepare_to_wait_exclusive + 11.17% 0.07% postgres [kernel.kallsyms] [k] unlock_page + 11.06% 0.38% postgres [kernel.kallsyms] [k] __wake_up_bit + 10.66% 0.02% postgres [kernel.kallsyms] [k] __wake_up + 9.02% 9.02% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 6.13% 2.10% postgres postgres [.] ExecProject + 5.87% 2.14% postgres postgres [.] ExecProject + 4.41% 2.95% postgres postgres [.] hash_search_with_hash_value + 4.16% 2.17% postgres postgres [.] advance_transition_function + 3.59% 3.56% postgres postgres [.] ExecClearTuple + 2.38% 0.01% postgres [kernel.kallsyms] [k] finish_wait + 2.33% 0.78% postgres postgres [.] ExecEvalExprSwitchContext + 2.16% 0.21% postgres postgres [.] LWLockAttemptLock + 2.03% 2.01% postgres postgres [.] ExecInterpExpr + 1.95% 0.64% postgres postgres [.] ExecEvalExprSwitchContext + 1.83% 1.82% postgres postgres [.] MemoryContextReset + 1.76% 0.02% postgres postgres [.] BufTableLookup + 1.75% 0.91% postgres [kernel.kallsyms] [k] __wake_up_common + 1.73% 1.72% postgres postgres [.] int8inc + 1.64% 1.01% postgres [kernel.kallsyms] [k] __find_get_page -- More details on that spin lock. - 21.93% 21.93% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave ▒ - _raw_spin_lock_irqsave ▒ - 48.37% prepare_to_wait_exclusive ▒ __wait_on_bit_lock ▒ __lock_page ▒ + get_futex_key ▒ + 39.15% __wake_up ▒ + 10.37% finish_wait ▒ + 1.18% pagevec_lru_move_fn ▒ + 0.77% try_to_wake_up -- Batch patch, 10 pages at a time, 71 workers. + 99.45% 1.91% postgres postgres [.] ExecProcNode + 99.45% 0.00% postgres postgres [.] ParallelWorkerMain + 99.45% 0.00% postgres postgres [.] ParallelQueryMain + 99.45% 0.00% postgres postgres [.] ExecutorRun + 99.45% 0.00% postgres postgres [.] standard_ExecutorRun + 99.45% 0.00% postgres postgres [.] ExecutePlan + 99.45% 0.00% postgres postgres [.] ExecAgg + 99.10% 1.18% postgres postgres [.] agg_retrieve_direct + 81.94% 1.42% postgres postgres [.] fetch_input_tuple + 78.62% 0.35% postgres postgres [.] ExecSeqScan + 77.88% 2.02% postgres postgres [.] ExecScan + 69.05% 0.53% postgres postgres [.] ExecScanFetch + 68.55% 1.17% postgres postgres [.] SeqNext + 66.03% 0.93% postgres postgres [.] heap_getnext + 65.10% 4.06% postgres postgres [.] heapgettup_pagemode + 60.60% 2.02% postgres postgres [.] heapgetpage + 57.07% 0.09% postgres postgres [.] ReadBufferExtended + 56.96% 0.10% postgres postgres [.] ReadBuffer_common + 45.67% 0.01% postgres [kernel.kallsyms] [k] system_call_fastpath + 42.58% 0.19% postgres postgres [.] BufferAlloc + 32.18% 0.02% postgres [kernel.kallsyms] [k] sys_futex + 32.16% 0.02% postgres [kernel.kallsyms] [k] do_futex + 30.04% 0.28% postgres [kernel.kallsyms] [k] get_futex_key + 23.99% 23.99% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 22.88% 0.15% postgres postgres [.] LWLockAcquire + 20.13% 0.30% postgres libpthread-2.17.so [.] sem_wait + 19.71% 0.08% postgres [kernel.kallsyms] [k] futex_wait + 19.05% 0.05% postgres [kernel.kallsyms] [k] futex_wait_setup + 15.98% 0.06% postgres [kernel.kallsyms] [k] __lock_page + 15.90% 0.44% postgres [kernel.kallsyms] [k] __wait_on_bit_lock + 14.70% 3.71% postgres postgres [.] advance_aggregates + 13.94% 0.21% postgres postgres [.] LWLockRelease + 13.84% 0.09% postgres postgres [.] smgrread + 13.74% 0.05% postgres postgres [.] mdread + 13.42% 0.06% postgres libpthread-2.17.so [.] __read_nocancel + 13.09% 0.19% postgres postgres [.] LWLockWakeup + 12.96% 0.05% postgres [kernel.kallsyms] [k] sys_read + 12.77% 0.08% postgres [kernel.kallsyms] [k] vfs_read + 12.71% 0.17% postgres libpthread-2.17.so [.] sem_post + 12.50% 0.08% postgres [kernel.kallsyms] [k] do_sync_read + 12.44% 0.15% postgres [xfs] [k] xfs_file_aio_read + 12.42% 0.15% postgres [kernel.kallsyms] [k] futex_wake + 12.42% 0.10% postgres [kernel.kallsyms] [k] unlock_page + 12.29% 0.42% postgres [kernel.kallsyms] [k] __wake_up_bit + 12.28% 0.48% postgres [kernel.kallsyms] [k] prepare_to_wait_exclusive + 11.85% 0.02% postgres [kernel.kallsyms] [k] __wake_up + 11.31% 0.14% postgres [kernel.kallsyms] [k] generic_file_aio_read + 8.83% 8.83% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 6.25% 1.77% postgres postgres [.] ExecProject + 5.87% 2.12% postgres postgres [.] ExecProject + 3.96% 2.63% postgres postgres [.] hash_search_with_hash_value + 3.88% 2.61% postgres postgres [.] advance_transition_function + 3.53% 3.51% postgres postgres [.] ExecClearTuple + 2.81% 2.79% postgres postgres [.] ExecInterpExpr + 2.79% 0.69% postgres postgres [.] ExecEvalExprSwitchContext + 2.24% 0.02% postgres [kernel.kallsyms] [k] finish_wait + 2.16% 0.22% postgres postgres [.] LWLockAttemptLock + 2.07% 2.06% postgres postgres [.] MemoryContextReset + 2.02% 0.63% postgres postgres [.] ExecEvalExprSwitchContext + 1.59% 0.80% postgres [kernel.kallsyms] [k] __wake_up_common + 1.55% 1.54% postgres postgres [.] pg_atomic_read_u32_impl + 1.52% 0.04% postgres postgres [.] pg_atomic_read_u32 + 1.51% 0.02% postgres postgres [.] BufTableLookup + 1.45% 0.07% postgres postgres [.] LockBufHdr
parallel_nextpage_atomics.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers