Jeff, Here are the 2 EXPLAINs for one of my simplest query:
flows=# SET track_io_timing = on; LOG: duration: 24.101 ms statement: SET track_io_timing = on; SET flows=# explain (analyze, timing off) SELECT DISTINCT flows-# srcaddr, flows-# dstaddr, flows-# dstport, flows-# COUNT(*) AS conversation, flows-# SUM(doctets) / 1024 / 1024 AS mbytes flows-# FROM flows-# flowscompact, flows-# mynetworks flows-# WHERE flows-# mynetworks.ipaddr >>= flowscompact.srcaddr flows-# AND dstaddr IN flows-# ( flows(# SELECT flows(# dstaddr flows(# FROM flows(# dstexterne flows(# ) flows-# GROUP BY flows-# srcaddr, flows-# dstaddr, flows-# dstport flows-# ORDER BY flows-# mbytes DESC LIMIT 50; LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.3", size 1073741824 LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.4", size 1073741824 LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.5", size 639696896 LOG: duration: 2632108.352 ms statement: explain (analyze, timing off) SELECT DISTINCT srcaddr, dstaddr, dstport, COUNT(*) AS conversation, SUM(doctets) / 1024 / 1024 AS mbytes FROM flowscompact, mynetworks WHERE mynetworks.ipaddr >>= flowscompact.srcaddr AND dstaddr IN ( SELECT dstaddr FROM dstexterne ) GROUP BY srcaddr, dstaddr, dstport ORDER BY mbytes DESC LIMIT 50; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual rows=50 loops=1) -> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual rows=50 loops=1) -> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual rows=50 loops=1) Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*)) Sort Method: quicksort Memory: 563150kB -> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual rows=4691734 loops=1) Group Key: flows.srcaddr, flows.dstaddr, flows.dstport -> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual rows=81896988 loops=1) Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport Sort Method: external merge Disk: 2721856kB -> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual rows=81896988 loops=1) Workers Planned: 9 Workers Launched: 9 -> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual rows=8189699 loops=10) Hash Cond: (flows.dstaddr = flows_1.dstaddr) -> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual rows=45499045 loops=10) -> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual rows=54155970 loops=10) -> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual rows=1 loops=541559704) Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r) Heap Fetches: 48679396 -> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual rows=3099798 loops=10) Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB -> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual rows=3099798 loops=10) Group Key: flows_1.dstaddr -> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual rows=113420172 loops=10) Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r) Rows Removed by Join Filter: 453681377 -> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual rows=541559704 loops=10) Heap Fetches: 91 -> Materialize (cost=0.00..1.02 rows=4 width=8) (actual rows=2 loops=5415597040) -> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual rows=4 loops=10) Planning time: 62.066 ms Execution time: 2631923.716 ms (33 rows) flows=# explain (analyze, buffers) SELECT DISTINCT flows-# srcaddr, flows-# dstaddr, flows-# dstport, flows-# COUNT(*) AS conversation, flows-# SUM(doctets) / 1024 / 1024 AS mbytes flows-# FROM flows-# flowscompact, flows-# mynetworks flows-# WHERE flows-# mynetworks.ipaddr >>= flowscompact.srcaddr flows-# AND dstaddr IN flows-# ( flows(# SELECT flows(# dstaddr flows(# FROM flows(# dstexterne flows(# ) flows-# GROUP BY flows-# srcaddr, flows-# dstaddr, flows-# dstport flows-# ORDER BY flows-# mbytes DESC LIMIT 50; LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824 LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824 LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896 LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT srcaddr, dstaddr, dstport, COUNT(*) AS conversation, SUM(doctets) / 1024 / 1024 AS mbytes FROM flowscompact, mynetworks WHERE mynetworks.ipaddr >>= flowscompact.srcaddr AND dstaddr IN ( SELECT dstaddr FROM dstexterne ) GROUP BY srcaddr, dstaddr, dstport ORDER BY mbytes DESC LIMIT 50; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1) Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244 I/O Timings: read=5323746.860 -> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1) Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244 I/O Timings: read=5323746.860 -> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1) Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*)) Sort Method: quicksort Memory: 563150kB Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244 I/O Timings: read=5323746.860 -> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1) Group Key: flows.srcaddr, flows.dstaddr, flows.dstport Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244 I/O Timings: read=5323746.860 -> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1) Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport Sort Method: external merge Disk: 2721856kB Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244 I/O Timings: read=5323746.860 -> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1) Workers Planned: 9 Workers Launched: 9 Buffers: shared hit=1116590559 read=15851133 I/O Timings: read=5323746.860 -> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10) Hash Cond: (flows.dstaddr = flows_1.dstaddr) Buffers: shared hit=1116588309 read=15851133 I/O Timings: read=5323746.860 -> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10) Buffers: shared hit=996551813 read=15851133 I/O Timings: read=5323746.860 -> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10) Buffers: shared hit=1634 read=15851133 I/O Timings: read=5323746.860 -> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704) Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r) Heap Fetches: 59971474 Buffers: shared hit=996550152 -> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10) Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB Buffers: shared hit=120036496 -> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10) Group Key: flows_1.dstaddr Buffers: shared hit=120036496 -> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10) Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r) Rows Removed by Join Filter: 453681377 Buffers: shared hit=120036496 -> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10) Heap Fetches: 91 Buffers: shared hit=120036459 -> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040) Buffers: shared hit=10 -> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10) Buffers: shared hit=10 Planning time: 6.689 ms Execution time: 2764860.853 ms (58 rows) Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks. Thanks! Charles On Wed, Jul 12, 2017 at 2:39 AM, Jeff Janes <jeff.ja...@gmail.com> wrote: > On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nad...@gmail.com> > wrote: > >> Jeff, >> >> I used fio in a quick benchmarking script inspired by >> https://smcleod.net/benchmarking-io/: >> >> #!/bin/bash >> #Random throughput >> echo "Random throughput" >> sync >> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 >> --name=test --filename=test --bs=4M --iodepth=256 --size=10G >> --readwrite=randread --ramp_time=4 >> #Random IOPS >> echo "Random IOPS" >> sync >> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 >> --name=test --filename=test --bs=4k --iodepth=256 --size=4G >> --readwrite=randread --ramp_time=4 >> #Sequential throughput >> echo "Sequential throughput" >> sync >> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 >> --name=test --filename=test --bs=4M --iodepth=256 --size=10G >> --readwrite=read --ramp_time=4 >> #Sequential IOPS >> echo "Sequential IOPS" >> sync >> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 >> --name=test --filename=test --bs=4k --iodepth=256 --size=4G >> --readwrite=read --ramp_time=4 >> >> > I don't think any of those are directly relevant to PostgreSQL, as it > doesn't use direct IO, doesn't use libaio, and is rarely going to get > anywhere near 256 iodepth. So the best they can do is put a theoretical > ceiling on the performance. Also, random IO with a 4MB stride doesn't make > any sense from a PostgreSQL perspective. > > > >> >> Performing the test you suggested, I get 128.5MB/s. Monitoring the test, >> I find that the throughput is constant from start to finish and that the >> iowait is also constant at 5%: >> > > I would have expected it to do better than that. Maybe you increase the > kernel readahead setting. I've found the default to be much too small. > But it doesn't make much difference to you, as you appear to be doing > random IO in your queries, not sequential. > > >> Could you suggest another way to benchmark random reads? >> > > Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which > is close to what you report. So I think I'd would instead focus on tuning > your actual queries. You say the problem is not query-dependent, but I > think that that just means all the queries you looked at are similar. If > you looked at a query that can't use indexes, like count(unindexed_column) > from biggest_table; you would find it doing much more IO than 4MB/s. > > Can you pick the simplest query you actually care about, and post both an > "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it? > (Preferably turning "track_io_timing" on first). > > One other question I had, you said you had "2x Intel Xeon E5550", which > should be 8 CPU (or 16, if the hyperthreads > are reported as separate CPUs). But you also said: "Also using dstat I > can see that iowait time is at about 25%". Usually if there is only one > thing going on on the server, then IOWAIT won't be more than reciprocal of > #CPU. Is the server busy doing other stuff at the same time you are > benchmarking it? > > Cheers, > > Jeff > -- Charles Nadeau Ph.D. http://charlesnadeau.blogspot.com/