Hi,

A while back I did some benchmarking on a big 4 socket machine to
explore a bit around the outer limits of parallel aggregates.  I
discovered along the way that, given enough workers, and a simple
enough task, that seq-scan workers were held up waiting for the lock
to be released in heap_parallelscan_nextpage().

I've since done a little work in this area to try to improve things. I
ended up posting about it yesterday in [1]. My original patch used
batching to solve the issue; instead of allocating 1 block at a time,
the batching patch allocated a range of 10 blocks for the worker to
process. However, the implementation still needed a bit of work around
reporting sync-scan locations.

Andres mentioned in [2] that it might be worth exploring using atomics
to do the same job. So I went ahead and did that, and came up with the
attached, which is a slight variation on what he mentioned in the
thread.

To keep things a bit more simple, and streamline, I ended up pulling
out the logic for setting the startblock into another function, which
we only call once before the first call to
heap_parallelscan_nextpage().  I also ended up changing phs_cblock and
replacing it with a counter that always starts at zero. The actual
block is calculated based on that + the startblock modulo nblocks.
This makes things a good bit more simple for detecting when we've
allocated all the blocks to the workers, and also works nicely when
wrapping back to the start of a relation when we started somewhere in
the middle due to piggybacking with a synchronous scan.

Performance:

With parallel_workers=71, it looks something like:

Query 1: 881 GB, ~6 billion row TPC-H lineitem table.

tpch=# select count(*) from lineitem;
   count
------------
 5999989709
(1 row)

-- Master
Time: 123421.283 ms (02:03.421)
Time: 118895.846 ms (01:58.896)
Time: 118632.546 ms (01:58.633)

-- Atomics patch
Time: 74038.813 ms (01:14.039)
Time: 73166.200 ms (01:13.166)
Time: 72492.338 ms (01:12.492)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 76364.215 ms (01:16.364)
Time: 75808.900 ms (01:15.809)
Time: 74927.756 ms (01:14.928)

Query 2: Single int column table with 2 billion rows.

tpch=# select count(*) from a;
   count
------------
 2000000000
(1 row)

-- Master
Time: 5853.918 ms (00:05.854)
Time: 5925.633 ms (00:05.926)
Time: 5859.223 ms (00:05.859)

-- Atomics patch
Time: 5825.745 ms (00:05.826)
Time: 5849.139 ms (00:05.849)
Time: 5815.818 ms (00:05.816)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 5789.237 ms (00:05.789)
Time: 5837.395 ms (00:05.837)
Time: 5821.492 ms (00:05.821)

I've also attached a text file with the perf report for the lineitem
query. You'll notice that the heap_parallelscan_nextpage() is very
visible in master, but not on each of the two patches.

With the 2nd query, heap_parallelscan_nextpage is fairly insignificant
on master's profile, it's only showing up as 0.48%. Likely this must
be due to more tuples being read from the page, and more aggregation
work getting done before the next page is needed. I'm uncertain why I
previously saw a speed up in this case in [1].

I've also noticed that both the atomics patch and unpatched master do
something that looks a bit weird with synchronous seq-scans. If the
parallel seq-scan piggybacked on another scan, then subsequent
parallel scans will start at the same non-zero block location, even
when no other concurrent scans exist. I'd have expected this should go
back to block 0 again, but maybe I'm just failing to understand the
reason for reporting the startblock to ss_report_location() at the end
of the scan.

I'll now add this to the first commitfest of pg11. I just wanted to
note that I've done this, so that it's less likely someone else goes
and repeats the same work.

[1] 
https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com

[2] 
https://www.postgresql.org/message-id/20170505023646.3uhnmf2hbwtm63lc%40alap3.anarazel.de

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
-- Unpatched select count(*) from lineitem; 71 workers
  Children      Self  Command   Shared Object         Symbol
+   99.83%     0.00%  postgres  libpthread-2.17.so    [.] __restore_rt
+   99.83%     0.00%  postgres  postgres              [.] sigusr1_handler
+   99.83%     0.00%  postgres  postgres              [.] maybe_start_bgworkers
+   99.83%     0.00%  postgres  postgres              [.] do_start_bgworker
+   99.83%     0.93%  postgres  postgres              [.] ExecProcNode
+   99.83%     0.00%  postgres  postgres              [.] StartBackgroundWorker
+   99.83%     0.00%  postgres  postgres              [.] ParallelWorkerMain
+   99.83%     0.00%  postgres  postgres              [.] ParallelQueryMain
+   99.83%     0.00%  postgres  postgres              [.] ExecutorRun
+   99.83%     0.00%  postgres  postgres              [.] standard_ExecutorRun
+   99.83%     0.00%  postgres  postgres              [.] ExecutePlan
+   99.83%     0.00%  postgres  postgres              [.] ExecAgg
+   99.69%     0.62%  postgres  postgres              [.] agg_retrieve_direct
+   91.28%     0.63%  postgres  postgres              [.] fetch_input_tuple
+   89.75%     0.17%  postgres  postgres              [.] ExecSeqScan
+   89.41%     1.07%  postgres  postgres              [.] ExecScan
+   84.46%     0.24%  postgres  postgres              [.] ExecScanFetch
+   84.26%     0.54%  postgres  postgres              [.] SeqNext
+   83.05%     0.44%  postgres  postgres              [.] heap_getnext
+   82.61%     2.00%  postgres  postgres              [.] heapgettup_pagemode
+   57.81%     0.84%  postgres  postgres              [.] 
heap_parallelscan_nextpage
+   53.46%    43.38%  postgres  postgres              [.] s_lock
+   21.82%     1.24%  postgres  postgres              [.] heapgetpage
+   20.69%     1.16%  postgres  postgres              [.] ReadBufferExtended
+   19.52%     0.04%  postgres  postgres              [.] ReadBuffer_common
+   13.75%     0.00%  postgres  [kernel.kallsyms]     [k] system_call_fastpath
+   11.54%     0.08%  postgres  postgres              [.] BufferAlloc
+    8.21%     8.16%  postgres  postgres              [.] tas
+    7.71%     0.03%  postgres  postgres              [.] smgrread
+    7.68%     0.02%  postgres  postgres              [.] mdread
+    7.22%     1.84%  postgres  postgres              [.] advance_aggregates
+    7.21%     0.03%  postgres  libpthread-2.17.so    [.] __read_nocancel
+    7.05%     0.02%  postgres  [kernel.kallsyms]     [k] sys_read
+    7.01%     0.02%  postgres  [kernel.kallsyms]     [k] vfs_read
+    6.88%     0.04%  postgres  [kernel.kallsyms]     [k] do_sync_read
+    6.85%     0.08%  postgres  [xfs]                 [k] xfs_file_aio_read
+    6.39%     0.01%  postgres  [kernel.kallsyms]     [k] sys_futex
+    6.39%     0.01%  postgres  [kernel.kallsyms]     [k] do_futex
+    6.27%     0.07%  postgres  [kernel.kallsyms]     [k] generic_file_aio_read
+    5.82%     0.11%  postgres  [kernel.kallsyms]     [k] get_futex_key
+    5.15%     0.07%  postgres  postgres              [.] LWLockAcquire
+    4.82%     4.82%  postgres  [kernel.kallsyms]     [k] 
copy_user_enhanced_fast_string
+    4.52%     4.52%  postgres  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
+    3.92%     0.09%  postgres  libpthread-2.17.so    [.] sem_wait
+    3.80%     0.02%  postgres  [kernel.kallsyms]     [k] futex_wait
+    3.61%     0.02%  postgres  [kernel.kallsyms]     [k] futex_wait_setup
+    3.34%     1.49%  postgres  postgres              [.] ExecProject
+    3.31%     3.29%  postgres  postgres              [.] tas
+    3.27%     0.10%  postgres  postgres              [.] LWLockRelease
+    3.04%     0.01%  postgres  [kernel.kallsyms]     [k] __lock_page
+    3.02%     0.09%  postgres  [kernel.kallsyms]     [k] __wait_on_bit_lock
+    3.01%     1.09%  postgres  postgres              [.] ExecProject
+    2.79%     0.06%  postgres  postgres              [.] LWLockWakeup
+    2.67%     0.06%  postgres  libpthread-2.17.so    [.] sem_post
+    2.58%     0.04%  postgres  [kernel.kallsyms]     [k] futex_wake
+    2.35%     0.02%  postgres  [kernel.kallsyms]     [k] unlock_page
+    2.31%     0.05%  postgres  [kernel.kallsyms]     [k] __wake_up_bit
+    2.26%     0.01%  postgres  [kernel.kallsyms]     [k] __wake_up
+    2.25%     0.09%  postgres  [kernel.kallsyms]     [k] 
prepare_to_wait_exclusive
+    2.23%     1.48%  postgres  postgres              [.] 
hash_search_with_hash_value
+    2.01%     1.01%  postgres  postgres              [.] 
advance_transition_function
+    1.85%     1.83%  postgres  postgres              [.] ExecClearTuple
+    1.28%     0.98%  postgres  postgres              [.] perform_spin_delay
+    1.14%     0.10%  postgres  postgres              [.] LWLockAttemptLock

-- Atomics patch  select count(*) from lineitem; 71 workers
+   99.58%     0.00%  postgres  postgres            [.] StartBackgroundWorker
+   99.58%     0.00%  postgres  postgres            [.] ParallelWorkerMain
+   99.58%     2.22%  postgres  postgres            [.] ExecProcNode
+   99.58%     0.00%  postgres  postgres            [.] ParallelQueryMain
+   99.58%     0.00%  postgres  postgres            [.] ExecutorRun
+   99.58%     0.00%  postgres  postgres            [.] standard_ExecutorRun
+   99.58%     0.00%  postgres  postgres            [.] ExecutePlan
+   99.58%     0.00%  postgres  postgres            [.] ExecAgg
+   99.33%     1.26%  postgres  postgres            [.] agg_retrieve_direct
+   82.33%     1.41%  postgres  postgres            [.] fetch_input_tuple
+   78.65%     0.37%  postgres  postgres            [.] ExecSeqScan
+   77.94%     1.99%  postgres  postgres            [.] ExecScan
+   69.07%     0.47%  postgres  postgres            [.] ExecScanFetch
+   68.65%     1.10%  postgres  postgres            [.] SeqNext
+   66.15%     0.89%  postgres  postgres            [.] heap_getnext
+   65.28%     4.12%  postgres  postgres            [.] heapgettup_pagemode
+   60.01%     1.99%  postgres  postgres            [.] heapgetpage
+   56.58%     0.07%  postgres  postgres            [.] ReadBufferExtended
+   56.49%     0.11%  postgres  postgres            [.] ReadBuffer_common
+   44.37%     0.00%  postgres  [kernel.kallsyms]   [k] system_call_fastpath
+   41.13%     0.18%  postgres  postgres            [.] BufferAlloc
+   30.34%     0.02%  postgres  [kernel.kallsyms]   [k] sys_futex
+   30.31%     0.02%  postgres  [kernel.kallsyms]   [k] do_futex
+   28.19%     0.50%  postgres  [kernel.kallsyms]   [k] get_futex_key
+   21.93%    21.93%  postgres  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
+   21.42%     0.15%  postgres  postgres            [.] LWLockAcquire
+   18.82%     0.33%  postgres  libpthread-2.17.so  [.] sem_wait
+   18.36%     0.08%  postgres  [kernel.kallsyms]   [k] futex_wait
+   17.67%     0.05%  postgres  [kernel.kallsyms]   [k] futex_wait_setup
+   15.32%     0.05%  postgres  [kernel.kallsyms]   [k] __lock_page
+   15.24%     0.44%  postgres  [kernel.kallsyms]   [k] __wait_on_bit_lock
+   14.76%     0.06%  postgres  postgres            [.] smgrread
+   14.70%     0.07%  postgres  postgres            [.] mdread
+   14.63%     3.64%  postgres  postgres            [.] advance_aggregates
+   13.69%     0.05%  postgres  libpthread-2.17.so  [.] __read_nocancel
+   13.60%     0.22%  postgres  postgres            [.] LWLockRelease
+   13.38%     0.05%  postgres  [kernel.kallsyms]   [k] sys_read
+   13.27%     0.06%  postgres  [kernel.kallsyms]   [k] vfs_read
+   13.01%     0.08%  postgres  [kernel.kallsyms]   [k] do_sync_read
+   12.94%     0.13%  postgres  [xfs]               [k] xfs_file_aio_read
+   12.68%     0.21%  postgres  postgres            [.] LWLockWakeup
+   12.27%     0.21%  postgres  libpthread-2.17.so  [.] sem_post
+   11.93%     0.14%  postgres  [kernel.kallsyms]   [k] futex_wake
+   11.87%     0.13%  postgres  [kernel.kallsyms]   [k] generic_file_aio_read
+   11.53%     0.48%  postgres  [kernel.kallsyms]   [k] 
prepare_to_wait_exclusive
+   11.17%     0.07%  postgres  [kernel.kallsyms]   [k] unlock_page
+   11.06%     0.38%  postgres  [kernel.kallsyms]   [k] __wake_up_bit
+   10.66%     0.02%  postgres  [kernel.kallsyms]   [k] __wake_up
+    9.02%     9.02%  postgres  [kernel.kallsyms]   [k] 
copy_user_enhanced_fast_string
+    6.13%     2.10%  postgres  postgres            [.] ExecProject
+    5.87%     2.14%  postgres  postgres            [.] ExecProject
+    4.41%     2.95%  postgres  postgres            [.] 
hash_search_with_hash_value
+    4.16%     2.17%  postgres  postgres            [.] 
advance_transition_function
+    3.59%     3.56%  postgres  postgres            [.] ExecClearTuple
+    2.38%     0.01%  postgres  [kernel.kallsyms]   [k] finish_wait
+    2.33%     0.78%  postgres  postgres            [.] 
ExecEvalExprSwitchContext
+    2.16%     0.21%  postgres  postgres            [.] LWLockAttemptLock
+    2.03%     2.01%  postgres  postgres            [.] ExecInterpExpr
+    1.95%     0.64%  postgres  postgres            [.] 
ExecEvalExprSwitchContext
+    1.83%     1.82%  postgres  postgres            [.] MemoryContextReset
+    1.76%     0.02%  postgres  postgres            [.] BufTableLookup
+    1.75%     0.91%  postgres  [kernel.kallsyms]   [k] __wake_up_common
+    1.73%     1.72%  postgres  postgres            [.] int8inc
+    1.64%     1.01%  postgres  [kernel.kallsyms]   [k] __find_get_page

-- More details on that spin lock.
-   21.93%    21.93%  postgres  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave  
                                                                                
                                                                             ▒
   - _raw_spin_lock_irqsave                                                     
                                                                                
                                                                             ▒
      - 48.37% prepare_to_wait_exclusive                                        
                                                                                
                                                                             ▒
           __wait_on_bit_lock                                                   
                                                                                
                                                                             ▒
           __lock_page                                                          
                                                                                
                                                                             ▒
         + get_futex_key                                                        
                                                                                
                                                                             ▒
      + 39.15% __wake_up                                                        
                                                                                
                                                                             ▒
      + 10.37% finish_wait                                                      
                                                                                
                                                                             ▒
      + 1.18% pagevec_lru_move_fn                                               
                                                                                
                                                                             ▒
      + 0.77% try_to_wake_up           

-- Batch patch, 10 pages at a time, 71 workers.
+   99.45%     1.91%  postgres  postgres            [.] ExecProcNode
+   99.45%     0.00%  postgres  postgres            [.] ParallelWorkerMain
+   99.45%     0.00%  postgres  postgres            [.] ParallelQueryMain
+   99.45%     0.00%  postgres  postgres            [.] ExecutorRun
+   99.45%     0.00%  postgres  postgres            [.] standard_ExecutorRun
+   99.45%     0.00%  postgres  postgres            [.] ExecutePlan
+   99.45%     0.00%  postgres  postgres            [.] ExecAgg
+   99.10%     1.18%  postgres  postgres            [.] agg_retrieve_direct
+   81.94%     1.42%  postgres  postgres            [.] fetch_input_tuple
+   78.62%     0.35%  postgres  postgres            [.] ExecSeqScan
+   77.88%     2.02%  postgres  postgres            [.] ExecScan
+   69.05%     0.53%  postgres  postgres            [.] ExecScanFetch
+   68.55%     1.17%  postgres  postgres            [.] SeqNext
+   66.03%     0.93%  postgres  postgres            [.] heap_getnext
+   65.10%     4.06%  postgres  postgres            [.] heapgettup_pagemode
+   60.60%     2.02%  postgres  postgres            [.] heapgetpage
+   57.07%     0.09%  postgres  postgres            [.] ReadBufferExtended
+   56.96%     0.10%  postgres  postgres            [.] ReadBuffer_common
+   45.67%     0.01%  postgres  [kernel.kallsyms]   [k] system_call_fastpath
+   42.58%     0.19%  postgres  postgres            [.] BufferAlloc
+   32.18%     0.02%  postgres  [kernel.kallsyms]   [k] sys_futex
+   32.16%     0.02%  postgres  [kernel.kallsyms]   [k] do_futex
+   30.04%     0.28%  postgres  [kernel.kallsyms]   [k] get_futex_key
+   23.99%    23.99%  postgres  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
+   22.88%     0.15%  postgres  postgres            [.] LWLockAcquire
+   20.13%     0.30%  postgres  libpthread-2.17.so  [.] sem_wait
+   19.71%     0.08%  postgres  [kernel.kallsyms]   [k] futex_wait
+   19.05%     0.05%  postgres  [kernel.kallsyms]   [k] futex_wait_setup
+   15.98%     0.06%  postgres  [kernel.kallsyms]   [k] __lock_page
+   15.90%     0.44%  postgres  [kernel.kallsyms]   [k] __wait_on_bit_lock
+   14.70%     3.71%  postgres  postgres            [.] advance_aggregates
+   13.94%     0.21%  postgres  postgres            [.] LWLockRelease
+   13.84%     0.09%  postgres  postgres            [.] smgrread
+   13.74%     0.05%  postgres  postgres            [.] mdread
+   13.42%     0.06%  postgres  libpthread-2.17.so  [.] __read_nocancel
+   13.09%     0.19%  postgres  postgres            [.] LWLockWakeup
+   12.96%     0.05%  postgres  [kernel.kallsyms]   [k] sys_read
+   12.77%     0.08%  postgres  [kernel.kallsyms]   [k] vfs_read
+   12.71%     0.17%  postgres  libpthread-2.17.so  [.] sem_post
+   12.50%     0.08%  postgres  [kernel.kallsyms]   [k] do_sync_read
+   12.44%     0.15%  postgres  [xfs]               [k] xfs_file_aio_read
+   12.42%     0.15%  postgres  [kernel.kallsyms]   [k] futex_wake
+   12.42%     0.10%  postgres  [kernel.kallsyms]   [k] unlock_page
+   12.29%     0.42%  postgres  [kernel.kallsyms]   [k] __wake_up_bit
+   12.28%     0.48%  postgres  [kernel.kallsyms]   [k] 
prepare_to_wait_exclusive
+   11.85%     0.02%  postgres  [kernel.kallsyms]   [k] __wake_up
+   11.31%     0.14%  postgres  [kernel.kallsyms]   [k] generic_file_aio_read
+    8.83%     8.83%  postgres  [kernel.kallsyms]   [k] 
copy_user_enhanced_fast_string
+    6.25%     1.77%  postgres  postgres            [.] ExecProject
+    5.87%     2.12%  postgres  postgres            [.] ExecProject
+    3.96%     2.63%  postgres  postgres            [.] 
hash_search_with_hash_value
+    3.88%     2.61%  postgres  postgres            [.] 
advance_transition_function
+    3.53%     3.51%  postgres  postgres            [.] ExecClearTuple
+    2.81%     2.79%  postgres  postgres            [.] ExecInterpExpr
+    2.79%     0.69%  postgres  postgres            [.] 
ExecEvalExprSwitchContext
+    2.24%     0.02%  postgres  [kernel.kallsyms]   [k] finish_wait
+    2.16%     0.22%  postgres  postgres            [.] LWLockAttemptLock
+    2.07%     2.06%  postgres  postgres            [.] MemoryContextReset
+    2.02%     0.63%  postgres  postgres            [.] 
ExecEvalExprSwitchContext
+    1.59%     0.80%  postgres  [kernel.kallsyms]   [k] __wake_up_common
+    1.55%     1.54%  postgres  postgres            [.] pg_atomic_read_u32_impl
+    1.52%     0.04%  postgres  postgres            [.] pg_atomic_read_u32
+    1.51%     0.02%  postgres  postgres            [.] BufTableLookup
+    1.45%     0.07%  postgres  postgres            [.] LockBufHdr

Attachment: parallel_nextpage_atomics.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to