Igor, Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB? Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB'; SET flows=# explain (analyze, buffers) SELECT DISTINCT srcaddr, dstaddr, dstport, COUNT(*) AS conversation, SUM(doctets) / 1024 / 1024 AS mbytes FROM flowscompact, mynetworks WHERE mynetworks.ipaddr >>= flowscompact.srcaddr AND dstaddr IN ( SELECT dstaddr FROM dstexterne ) GROUP BY srcaddr, dstaddr, dstport ORDER BY mbytes DESC LIMIT 50; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1) Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154 -> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1) Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154 -> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1) Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*)) Sort Method: quicksort Memory: 654395kB Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154 -> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1) Group Key: flows.srcaddr, flows.dstaddr, flows.dstport Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154 -> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1) Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport Sort Method: external merge Disk: 3049216kB Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154 -> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1) Workers Planned: 12 Workers Launched: 12 Buffers: shared hit=728798037 read=82974833 -> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13) Hash Cond: (flows.dstaddr = flows_1.dstaddr) Buffers: shared hit=728795193 read=82974833 -> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13) Buffers: shared hit=590692229 read=14991777 -> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13) Buffers: shared hit=860990 read=14991777 -> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190) Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r) Heap Fetches: 0 Buffers: shared hit=589831203 -> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13) Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB Buffers: shared hit=138102964 read=67983056 -> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13) Group Key: flows_1.dstaddr Buffers: shared hit=138102964 read=67983056 -> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13) Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r) Rows Removed by Join Filter: 503353617 Buffers: shared hit=138102964 read=67983056 -> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13) Buffers: shared hit=138102915 read=67983056 -> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470) Buffers: shared hit=13 -> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13) Buffers: shared hit=13 Planning time: 0.941 ms Execution time: 2228345.171 ms (48 rows) With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields. flows=# \d+ flowscompact; View "public.flowscompact" Column | Type | Modifiers | Storage | Description -----------+--------------------------+-----------+---------+------------- flow_id | bigint | | plain | sysuptime | bigint | | plain | exaddr | ip4 | | plain | dpkts | integer | | plain | doctets | bigint | | plain | first | bigint | | plain | last | bigint | | plain | srcaddr | ip4 | | plain | dstaddr | ip4 | | plain | srcport | integer | | plain | dstport | integer | | plain | prot | smallint | | plain | tos | smallint | | plain | tcp_flags | smallint | | plain | timestamp | timestamp with time zone | | plain | View definition: SELECT flowstimestamp.flow_id, flowstimestamp.sysuptime, flowstimestamp.exaddr, flowstimestamp.dpkts, flowstimestamp.doctets, flowstimestamp.first, flowstimestamp.last, flowstimestamp.srcaddr, flowstimestamp.dstaddr, flowstimestamp.srcport, flowstimestamp.dstport, flowstimestamp.prot, flowstimestamp.tos, flowstimestamp.tcp_flags, flowstimestamp."timestamp" FROM flowstimestamp; mynetworks is a table having one column and 4 rows; it contains a list of our network networks: flows=# select * from mynetworks; ipaddr ---------------- 192.168.0.0/24 10.112.12.0/30 10.112.12.4/30 10.112.12.8/30 (4 row) flows=# \d+ mynetworks; Table "public.mynetworks" Column | Type | Modifiers | Storage | Stats target | Description --------+------+-----------+---------+--------------+------------- ipaddr | ip4r | | plain | | Indexes: "mynetworks_ipaddr_idx" gist (ipaddr) dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows. flows=# \d+ dstexterne; View "public.dstexterne" Column | Type | Modifiers | Storage | Description ---------+------+-----------+---------+------------- dstaddr | ip4 | | plain | View definition: SELECT DISTINCT flowscompact.dstaddr FROM flowscompact LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r WHERE mynetworks.ipaddr IS NULL; Thanks! Charles On Wed, Jul 12, 2017 at 6:39 PM, Igor Neyman <iney...@perceptron.com> wrote: > > > > > *From:* pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance- > ow...@postgresql.org <pgsql-performance-ow...@postgresql.org>] *On Behalf > Of *Charles Nadeau > *Sent:* Wednesday, July 12, 2017 6:05 AM > *To:* Jeff Janes <jeff.ja...@gmail.com> > *Cc:* pgsql-performance@postgresql.org > *Subject:* Re: [PERFORM] Very poor read performance, query independent > > > > > > flows=# explain (analyze, buffers) SELECT DISTINCT > > flows-# srcaddr, > > flows-# dstaddr, > > flows-# dstport, > > flows-# COUNT(*) AS conversation, > > flows-# SUM(doctets) / 1024 / 1024 AS mbytes > > flows-# FROM > > flows-# flowscompact, > > flows-# mynetworks > > flows-# WHERE > > flows-# mynetworks.ipaddr >>= flowscompact.srcaddr > > flows-# AND dstaddr IN > > flows-# ( > > flows(# SELECT > > flows(# dstaddr > > flows(# FROM > > flows(# dstexterne > > flows(# ) > > flows-# GROUP BY > > flows-# srcaddr, > > flows-# dstaddr, > > flows-# dstport > > flows-# ORDER BY > > flows-# mbytes DESC LIMIT 50; > > LOG: temporary file: path "pg_tblspc/36238/PG_9.6_ > 201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824 > > LOG: temporary file: path "pg_tblspc/36238/PG_9.6_ > 201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824 > > LOG: temporary file: path "pg_tblspc/36238/PG_9.6_ > 201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896 > > LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) > SELECT DISTINCT > > srcaddr, > > dstaddr, > > dstport, > > COUNT(*) AS conversation, > > SUM(doctets) / 1024 / 1024 AS mbytes > > FROM > > flowscompact, > > mynetworks > > WHERE > > mynetworks.ipaddr >>= flowscompact.srcaddr > > AND dstaddr IN > > ( > > SELECT > > dstaddr > > FROM > > dstexterne > > ) > > GROUP BY > > srcaddr, > > dstaddr, > > dstport > > ORDER BY > > mbytes DESC LIMIT 50; > > > QUERY PLAN > > > > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------------------------------------------------ > -------------------------------------------------- > > Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual > time=2764548.863..2764548.891 rows=50 loops=1) > > Buffers: shared hit=1116590560 read=15851133, temp read=340244 > written=340244 > > I/O Timings: read=5323746.860 > > -> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) > (actual time=2764548.861..2764548.882 rows=50 loops=1) > > Buffers: shared hit=1116590560 read=15851133, temp read=340244 > written=340244 > > I/O Timings: read=5323746.860 > > -> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) > (actual time=2764548.859..2764548.872 rows=50 loops=1) > > Sort Key: (((sum(flows.doctets) / '1024'::numeric) / > '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, > (count(*)) > > Sort Method: quicksort Memory: 563150kB > > Buffers: shared hit=1116590560 read=15851133, temp > read=340244 written=340244 > > I/O Timings: read=5323746.860 > > -> GroupAggregate (cost=37698151.34..37714980.68 > rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 > loops=1) > > Group Key: flows.srcaddr, flows.dstaddr, flows.dstport > > Buffers: shared hit=1116590560 read=15851133, temp > read=340244 written=340244 > > I/O Timings: read=5323746.860 > > -> Sort (cost=37698151.34..37699273.29 > rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 > loops=1) > > Sort Key: flows.srcaddr, flows.dstaddr, > flows.dstport > > Sort Method: external merge Disk: 2721856kB > > Buffers: shared hit=1116590560 read=15851133, > temp read=340244 written=340244 > > I/O Timings: read=5323746.860 > > -> Gather (cost=19463936.00..37650810.19 > rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 > loops=1) > > Workers Planned: 9 > > Workers Launched: 9 > > Buffers: shared hit=1116590559 > read=15851133 > > I/O Timings: read=5323746.860 > > -> Hash Semi Join > (cost=19462936.00..37622883.23 rows=249324 width=20) (actual > time=1847579.360..2602039.780 rows=8189699 loops=10) > > Hash Cond: (flows.dstaddr = > flows_1.dstaddr) > > Buffers: shared hit=1116588309 > read=15851133 > > I/O Timings: read=5323746.860 > > -> Nested Loop > (cost=0.03..18159012.30 rows=249324 width=20) (actual > time=1.562..736556.583 rows=45499045 loops=10) > > Buffers: shared hit=996551813 > read=15851133 > > I/O Timings: read=5323746.860 > > -> Parallel Seq Scan on > flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual > time=1.506..547485.066 rows=54155970 loops=10) > > Buffers: shared > hit=1634 read=15851133 > > I/O Timings: > read=5323746.860 > > -> Index Only Scan using > mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) > (actual time=0.002..0.002 rows=1 loops=541559704) > > Index Cond: (ipaddr >>= > (flows.srcaddr)::ip4r) > > Heap Fetches: 59971474 > > Buffers: shared > hit=996550152 > > -> Hash > (cost=19462896.74..19462896.74 rows=11210 width=4) (actual > time=1847228.894..1847228.894 rows=3099798 loops=10) > > Buckets: 4194304 (originally > 16384) Batches: 1 (originally 1) Memory Usage: 141746kB > > Buffers: shared hit=120036496 > > -> HashAggregate > (cost=19462829.48..19462863.11 rows=11210 width=4) (actual > time=1230049.015..1845955.764 rows=3099798 loops=10) > > Group Key: > flows_1.dstaddr > > Buffers: shared > hit=120036496 > > -> Nested Loop Anti > Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual > time=0.084..831832.333 rows=113420172 loops=10) > > Join Filter: > (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r) > > Rows Removed by > Join Filter: 453681377 > > Buffers: shared > hit=120036496 > > -> Index Only > Scan using flows_srcaddr_dstaddr_idx on flows flows_1 > (cost=0.12..9091067.70 rows=560978368 width=4) (actual > time=0.027..113052.437 rows=541559704 loops=10) > > Heap > Fetches: 91 > > Buffers: > shared hit=120036459 > > -> Materialize > (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 > loops=5415597040) > > Buffers: > shared hit=10 > > -> Seq > Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual > time=0.007..0.008 rows=4 loops=10) > > > Buffers: shared hit=10 > > Planning time: 6.689 ms > > Execution time: 2764860.853 ms > > (58 rows) > > > > Regarding "Also using dstat I can see that iowait time is at about 25%", I > don't think the server was doing anything else. If it is important, I can > repeat the benchmarks. > > Thanks! > > > > Charles > > > > Charles, > > > > In your original posting I couldn’t find what value you set for > temp_buffers. > > Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and > then run ‘explain (analyze, buffers) select…’ in the same session. This > should alleviate “disk sort’ problem. > > > > Also, could you post the structure of flowscompact, mynetworks, and > dstextern tables with all the indexes and number of rows. Actually, are > they all – tables, or some of them – views? > > > > Igor > > > > > > Sorry, I misstated the parameter to change. > > It is work_mem (not temp_buffers) you should try to increase to 6GB. > > > > Igor > > > > > -- Charles Nadeau Ph.D. http://charlesnadeau.blogspot.com/