Robert Haas wrote: > > I don't have a strong opinion on that. Since we know how many tuples > were processed by each worker, knowing how many heap fetches we have > on a per-worker basis seems like a good thing to have, too. On the > other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that > anyway, since it knows about displaying per-worker instrumentation > (look for /* Show worker detail */). If it doesn't, then that's > probably bad, because I'm pretty sure that when I installed that code > the stuff that got displayed for worker instrumentation pretty much > matched the stuff that got displayed for overall instrumentation.
So, after some experimentation I was able to produce this output with the previously proposed patch: QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Gather Merge (cost=1000.51..27083147.02 rows=22950230 width=12) (actual time=1198.906..1198.906 rows=0 loops=1) Output: tbl1.col1, tprt_1.col1, tprt_1.col1 Workers Planned: 5 Workers Launched: 5 Buffers: shared hit=267516 -> Nested Loop (cost=0.43..24318369.79 rows=4590046 width=12) (actual time=1183.291..1183.291 rows=0 loops=6) Output: tbl1.col1, tprt_1.col1, tprt_1.col1 Join Filter: (tbl1.col1 = tprt_1.col1) Rows Removed by Join Filter: 1200012 Buffers: shared hit=1836184 Worker 0: actual time=1174.526..1174.526 rows=0 loops=1 Buffers: shared hit=245599 Worker 1: actual time=1178.845..1178.845 rows=0 loops=1 Buffers: shared hit=270694 Worker 2: actual time=1185.703..1185.703 rows=0 loops=1 Buffers: shared hit=313607 Worker 3: actual time=1189.595..1189.595 rows=0 loops=1 Buffers: shared hit=383714 Worker 4: actual time=1176.965..1176.965 rows=0 loops=1 Buffers: shared hit=355054 -> Parallel Index Only Scan using tprt1_idx on public.tprt_1 (cost=0.43..63100.29 rows=360004 width=4) (actual time=0.041..209.113 rows=300003 loops=6) Output: tprt_1.col1 Heap Fetches: 1800018 Buffers: shared hit=36166 Worker 0: actual time=0.035..192.255 rows=240750 loops=1 Buffers: shared hit=4849 Worker 1: actual time=0.031..184.332 rows=265307 loops=1 Buffers: shared hit=5387 Worker 2: actual time=0.046..272.412 rows=307509 loops=1 Buffers: shared hit=6098 Worker 3: actual time=0.035..212.866 rows=376064 loops=1 Buffers: shared hit=7650 Worker 4: actual time=0.036..201.076 rows=348048 loops=1 Buffers: shared hit=7006 -> Seq Scan on public.tbl1 (cost=0.00..35.50 rows=2550 width=4) (actual time=0.001..0.002 rows=4 loo Time: 1205,533 ms (00:01,206) You're right that per-worker heap-fetches would be probably better, but that seems like a project larger than I'm willing to tackle at this point in the devel cycle -- mainly because that "worker detail" does not have node-specific handling, so there's a bunch of new code to be written there. This amount of detail seems about right for me: I'm not sure that you really care all that much about how many fetches each worker had to do, but rather you care about how many there were in the table as a whole. Another query (without verbose) gives this: QUERY PLAN ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Gather (cost=20000001002.10..20249583898.07 rows=80390828 width=12) (actual time=1554.752..6633.272 rows=3600036 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Nested Loop (cost=20000000002.10..20241543815.27 rows=80390828 width=12) (actual time=1547.552..5832.355 rows=3600036 loops=1) Join Filter: (tbl1.col1 = tprt_1.col1) Rows Removed by Join Filter: 21600216 Buffers: shared hit=75688 read=54346 -> Merge Append (cost=2.10..371288.65 rows=6305163 width=4) (actual time=0.758..1980.024 rows=6300063 loops=1) Sort Key: tprt_1.col1 Buffers: shared hit=75688 read=54345 -> Index Only Scan using tprt1_idx on tprt_1 (cost=0.43..77500.44 rows=1800018 width=4) (actual time=0.211..387.916 rows=1800018 loops=1) Heap Fetches: 1800018 Buffers: shared hit=22028 read=14035 -> Index Only Scan using tprt2_idx on tprt_2 (cost=0.43..121147.34 rows=2700027 width=4) (actual time=0.180..628.350 rows=2700027 loops=1) Heap Fetches: 2700027 Buffers: shared hit=40597 read=26251 -> Index Only Scan using tprt3_idx on tprt_3 (cost=0.42..29722.56 rows=900009 width=4) (actual time=0.153..184.429 rows=900009 loops=1) Heap Fetches: 900009 Buffers: shared hit=6538 read=7029 -> Index Only Scan using tprt4_idx on tprt_4 (cost=0.15..82.41 rows=2550 width=4) (actual time=0.023..0.023 rows=0 loops=1) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using tprt5_idx on tprt_5 (cost=0.15..82.41 rows=2550 width=4) (actual time=0.018..0.018 rows=0 loops=1) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using tprt6_idx on tprt_6 (cost=0.42..29734.56 rows=900009 width=4) (actual time=0.165..191.343 rows=900009 loops=1) Heap Fetches: 900009 Buffers: shared hit=6523 read=7030 -> Materialize (cost=10000000000.00..10000000048.25 rows=2550 width=4) (actual time=0.000..0.000 rows=4 loops=6300063) Buffers: shared read=1 -> Seq Scan on tbl1 (cost=10000000000.00..10000000035.50 rows=2550 width=4) (actual time=0.033..0.036 rows=4 loops=1) Buffers: shared read=1 Planning Time: 4.344 ms Execution Time: 7008.095 ms (35 filas) Looks alright to me. > In any case part of the point is that Instrumentation is supposed to > be a generic structure that contains things that are for the most part > common to all node types. So what MERGE did to that structure looks > like in inadvisable kludge to me. I'd get rid of that and do it a > different way rather than propagate the idea that nfilteredX is > scratch space that can mean something different in every separate > node. Agreed. My proposed patch adds another generic counter "tuples2", which can be used at least by ON CONFLICT. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services