Here is one other thing I could learn from TPC-DS benchmark. The attached query is Q4 of TPC-DS, and its result was towards SF=100. It took long time to compete (about 30min), please see the attached EXPLAIN ANALYZE output.
Its core workload is placed on CTE year_total. Its Append node has underlying three HashAggregate nodes which also has tables join for each. Below shows the first HashAggregate node. It consumes 268M rows, then generates 9M rows. Underlying GpuJoin takes 74sec to process 268M rows, so we can guess HashAggregate consumed 400sec. HashAggregate (cost=18194948.40..21477516.00 rows=262605408 width=178) (actual time=480652.856..488055.918 rows=9142442 loops=1) Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year -> Custom Scan (GpuJoin) (cost=101342.54..9660272.64 rows=262605408 width=178) (actual time=2472.174..73908.894 rows=268562375 loops=1) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows (287997024 -> 275041999, 95.50% expected 95.47%) Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows (275041999 -> 268562375, 93.25% expected 91.18%) -> Custom Scan (BulkScan) on store_sales (cost=0.00..9649559.60 rows=287996960 width=38) (actual time=18.372..54522.803 rows=287997024 loops=1) -> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.015..15.533 rows=73049 loops=1) -> Seq Scan on customer (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.018..582.373 rows=2000000 loops=1) Other two HashAggregate nodes have similar behavior. The second one consumed 281sec, including 30sec by underlying GpuJoin. The third one consumed 138sec, including 25sec by underlying GpuJoin. Apart from my custom join implementation, It seems to me HashAggregate node consumed too much time than expectation. One characteristics of this workload is, this aggregation takes eight grouping-keys. I doubt cost of function invocation for hash-value and equal-checks may be criminal. Let's dive into nodeAgg.c. ExecAgg() calls agg_fill_hash_table() to fill up the hash table with rows fetched from the underlying tables. On construction of the hash table, it calls hash functions (at TupleHashTableHash) and equal check functions (at execTuplesMatch) repeatedly. Both of them uses FunctionCallX() interface that exchange the argument using FmgrInfo structure. Usually, it is not the best way from performance perspective. Especially, this workload takes 268M input rows and 8 grouping keys, so 268M (rows) x 8 (grouping keys) x 2 (for hash/equal), 4.2B times function calls via FmgrInfo shall happen. I think SortSupport logic provides a reasonable way to solve this kind of problem. For example, btint4sortsupport() informs a function pointer of the fast version of comparator (btint4fastcmp) which takes two Datum argument without indirect memory reference. This mechanism will also make sense for HashAggregate logic, to reduce the cost of function invocations. Please comment on the idea I noticed here. And, as an aside, if HashAggregate picks up a hash-value of grouping-keys from the target-list of underlying plan node (GpuJoin in this case), it may be possible to off-load calculation of hash-values on GPU device. :-) Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kai...@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Thursday, August 13, 2015 8:23 PM > To: Greg Stark > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable > plan > > > On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > > > In fact, cost of HashJoin underlying Sort node is: > > > -> Hash Join (cost=621264.91..752685.48 rows=1 width=132) > > > > > > On the other hands, NestedLoop on same place is: > > > -> Nested Loop (cost=0.00..752732.26 rows=1 width=132) > > > > > > Probably, small GUC adjustment may make optimizer to prefer HashJoin > > > towards > > > these kind of queries. > > > > With that kind of discrepancy I doubt adjusting GUCs will be sufficient > > > > > Do you have a good idea? > > > > Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any > > row estimates that are way off? > > > Yes, EXPLAIN ANALYZE is attached. > > According to this, CTE year_total generates 384,208 rows. It is much smaller > than estimation (4.78M rows), however, filter's selectivity of CTE Scan was > not large as expectation. > For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even > though > 40 rows were expected. On the next level, relations join between 11324 rows > and > 9952 rows, towards to estimation of 40rows x 8 rows. > If NestLoop is placed instead of HashJoin, it will make an explosion of the > number > of loops. > > Thanks, > -- > NEC Business Creation Division / PG-Strom Project > KaiGai Kohei <kai...@ak.jp.nec.com>
EXPLAIN ANALYZE with year_total as ( select c_customer_id customer_id ,c_first_name customer_first_name ,c_last_name customer_last_name ,c_preferred_cust_flag customer_preferred_cust_flag ,c_birth_country customer_birth_country ,c_login customer_login ,c_email_address customer_email_address ,d_year dyear ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2) year_total ,'s' sale_type from customer ,store_sales ,date_dim where c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk group by c_customer_id ,c_first_name ,c_last_name ,c_preferred_cust_flag ,c_birth_country ,c_login ,c_email_address ,d_year union all select c_customer_id customer_id ,c_first_name customer_first_name ,c_last_name customer_last_name ,c_preferred_cust_flag customer_preferred_cust_flag ,c_birth_country customer_birth_country ,c_login customer_login ,c_email_address customer_email_address ,d_year dyear ,sum((((cs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2) ) year_total ,'c' sale_type from customer ,catalog_sales ,date_dim where c_customer_sk = cs_bill_customer_sk and cs_sold_date_sk = d_date_sk group by c_customer_id ,c_first_name ,c_last_name ,c_preferred_cust_flag ,c_birth_country ,c_login ,c_email_address ,d_year union all select c_customer_id customer_id ,c_first_name customer_first_name ,c_last_name customer_last_name ,c_preferred_cust_flag customer_preferred_cust_flag ,c_birth_country customer_birth_country ,c_login customer_login ,c_email_address customer_email_address ,d_year dyear ,sum((((ws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2) ) year_total ,'w' sale_type from customer ,web_sales ,date_dim where c_customer_sk = ws_bill_customer_sk and ws_sold_date_sk = d_date_sk group by c_customer_id ,c_first_name ,c_last_name ,c_preferred_cust_flag ,c_birth_country ,c_login ,c_email_address ,d_year ) select t_s_secyear.customer_id ,t_s_secyear.customer_first_name ,t_s_secyear.customer_last_name ,t_s_secyear.customer_email_address from year_total t_s_firstyear ,year_total t_s_secyear ,year_total t_c_firstyear ,year_total t_c_secyear ,year_total t_w_firstyear ,year_total t_w_secyear where t_s_secyear.customer_id = t_s_firstyear.customer_id and t_s_firstyear.customer_id = t_c_secyear.customer_id and t_s_firstyear.customer_id = t_c_firstyear.customer_id and t_s_firstyear.customer_id = t_w_firstyear.customer_id and t_s_firstyear.customer_id = t_w_secyear.customer_id and t_s_firstyear.sale_type = 's' and t_c_firstyear.sale_type = 'c' and t_w_firstyear.sale_type = 'w' and t_s_secyear.sale_type = 's' and t_c_secyear.sale_type = 'c' and t_w_secyear.sale_type = 'w' and t_s_firstyear.dyear = 2001 and t_s_secyear.dyear = 2001+1 and t_c_firstyear.dyear = 2001 and t_c_secyear.dyear = 2001+1 and t_w_firstyear.dyear = 2001 and t_w_secyear.dyear = 2001+1 and t_s_firstyear.year_total > 0 and t_c_firstyear.year_total > 0 and t_w_firstyear.year_total > 0 and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / t_c_firstyear.year_total else null end > case when t_s_firstyear.year_total > 0 then t_s_secyear.year_total / t_s_firstyear.year_total else null end and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / t_c_firstyear.year_total else null end > case when t_w_firstyear.year_total > 0 then t_w_secyear.year_total / t_w_firstyear.year_total else null end order by t_s_secyear.customer_id ,t_s_secyear.customer_first_name ,t_s_secyear.customer_last_name ,t_s_secyear.customer_email_address limit 100; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=2374127478.27..2374127478.52 rows=100 width=132) (actual time=1873138.288..1873138.300 rows=100 loops=1) CTE year_total -> Append (cost=18194948.40..46107899.67 rows=477116919 width=220) (actual time=480652.857..909366.179 rows=21696335 loops=1) -> HashAggregate (cost=18194948.40..21477516.00 rows=262605408 width=178) (actual time=480652.856..488055.918 rows=9142442 loops=1) Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year -> Custom Scan (GpuJoin) (cost=101342.54..9660272.64 rows=262605408 width=178) (actual time=2472.174..73908.894 rows=268562375 loops=1) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows (287997024 -> 275041999, 95.50% expected 95.47%) Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows (275041999 -> 268562375, 93.25% expected 91.18%) -> Custom Scan (BulkScan) on store_sales (cost=0.00..9649559.60 rows=287996960 width=38) (actual time=18.372..54522.803 rows=287997024 loops=1) -> Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.015..15.533 rows=73049 loops=1) -> Seq Scan on customer (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.018..582.373 rows=2000000 loops=1) -> HashAggregate (cost=11391014.70..13172962.90 rows=142555856 width=181) (actual time=274756.424..281330.219 rows=8037433 loops=1) Group Key: customer_1.c_customer_id, customer_1.c_first_name, customer_1.c_last_name, customer_1.c_preferred_cust_flag, customer_1.c_birth_country, customer_1.c_login, customer_1.c_email_address, date_dim_1.d_year -> Custom Scan (GpuJoin) (cost=101368.56..6757949.38 rows=142555856 width=181) (actual time=1215.477..49357.736 rows=142917412 loops=1) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (cs_bill_customer_sk), JoinQual: (c_customer_sk = cs_bill_customer_sk), nrows (143997065 -> 143277592, 99.50% expected 99.48%) Depth 2: Logic: GpuHashJoin, HashKeys: (cs_sold_date_sk), JoinQual: (cs_sold_date_sk = d_date_sk), nrows (143277592 -> 142917412, 99.25% expected 99.00%) -> Custom Scan (BulkScan) on catalog_sales (cost=0.00..6555775.08 rows=143997008 width=41) (actual time=15.789..39415.420 rows=143997065 loops=1) -> Seq Scan on customer customer_1 (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.012..264.787 rows=2000000 loops=1) -> Seq Scan on date_dim date_dim_1 (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.004..9.851 rows=73049 loops=1) -> HashAggregate (cost=5786805.89..6686251.57 rows=71955655 width=181) (actual time=134861.430..138412.674 rows=4516460 loops=1) Group Key: customer_2.c_customer_id, customer_2.c_first_name, customer_2.c_last_name, customer_2.c_preferred_cust_flag, customer_2.c_birth_country, customer_2.c_login, customer_2.c_email_address, date_dim_2.d_year -> Custom Scan (GpuJoin) (cost=101379.26..3448247.10 rows=71955655 width=181) (actual time=1071.093..25145.723 rows=71974446 loops=1) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (ws_bill_customer_sk), JoinQual: (c_customer_sk = ws_bill_customer_sk), nrows (72001237 -> 71983355, 99.98% expected 99.96%) Depth 2: Logic: GpuHashJoin, HashKeys: (ws_sold_date_sk), JoinQual: (ws_sold_date_sk = d_date_sk), nrows (71983355 -> 71974446, 99.96% expected 99.94%) -> Custom Scan (BulkScan) on web_sales (cost=0.00..3290612.48 rows=72001248 width=41) (actual time=11.205..19596.402 rows=72001237 loops=1) -> Seq Scan on customer customer_2 (cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.012..264.418 rows=2000000 loops=1) -> Seq Scan on date_dim date_dim_2 (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.004..9.406 rows=73049 loops=1) -> Sort (cost=2328019578.60..2420615155.73 rows=37038230850 width=132) (actual time=1873138.287..1873138.293 rows=100 loops=1) Sort Key: t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_email_address Sort Method: top-N heapsort Memory: 50kB -> Custom Scan (GpuJoin) (cost=62321910.10..912445027.90 rows=37038230850 width=132) (actual time=1871500.013..1873137.441 rows=1812 loops=1) Filter: (CASE WHEN (year_total > '0'::numeric) THEN (year_total / year_total) ELSE NULL::numeric END > CASE WHEN (year_total > '0'::numeric) THEN (year_total / year_total) ELSE NULL::numer ic END) Rows Removed by Filter: 768 Bulkload: Off Depth 1: Logic: GpuHashJoin, HashKeys: (customer_id), JoinQual: (customer_id = customer_id), nrows (12889 -> 5689, 44.14% expected 1988.00%) Depth 2: Logic: GpuHashJoin, HashKeys: (customer_id), JoinQual: (customer_id = customer_id), nrows (5689 -> 2580, 20.02% expected 39521.44%) -> Custom Scan (GpuJoin) (cost=36984981.70..52218645.48 rows=93716805 width=256) (actual time=20512.681..943222.290 rows=12889 loops=1) Filter: (CASE WHEN (year_total > '0'::numeric) THEN (year_total / year_total) ELSE NULL::numeric END > CASE WHEN (year_total > '0'::numeric) THEN (year_total / year_total) ELSE NULL::numeric END) Rows Removed by Filter: 13531 Bulkload: Off Depth 1: Logic: GpuHashJoin, HashKeys: (customer_id), JoinQual: (customer_id = customer_id), nrows (1816438 -> 42439, 2.34% expected 100.00%) Depth 2: Logic: GpuHashJoin, HashKeys: (customer_id), JoinQual: (customer_id = customer_id), nrows (42439 -> 26420, 1.45% expected 2357062.50%) -> CTE Scan on year_total t_s_firstyear (cost=0.00..13120715.27 rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1) Filter: ((year_total > '0'::numeric) AND (sale_type = 's'::text) AND (dyear = 2001)) Rows Removed by Filter: 19879897 -> CTE Scan on year_total t_s_secyear (cost=0.00..11927922.98 rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1) Filter: ((sale_type = 's'::text) AND (dyear = 2002)) Rows Removed by Filter: 185596 -> Merge Join (cost=25049683.60..25053260.42 rows=237129 width=104) (actual time=14017.124..15376.421 rows=1243815 loops=1) Merge Cond: (t_c_firstyear.customer_id = t_c_secyear.customer_id) -> Sort (cost=13120952.98..13120962.92 rows=3976 width=52) (actual time=7764.765..8009.308 rows=1561656 loops=1) Sort Key: t_c_firstyear.customer_id Sort Method: quicksort Memory: 171157kB -> CTE Scan on year_total t_c_firstyear (cost=0.00..13120715.27 rows=3976 width=52) (actual time=2116.763..5383.853 rows=1561657 loops=1) Filter: ((year_total > '0'::numeric) AND (sale_type = 'c'::text) AND (dyear = 2001)) Rows Removed by Filter: 20134678 -> Sort (cost=11928730.62..11928760.44 rows=11928 width=52) (actual time=6252.341..6503.035 rows=1593426 loops=1) Sort Key: t_c_secyear.customer_id Sort Method: quicksort Memory: 173639kB -> CTE Scan on year_total t_c_secyear (cost=0.00..11927922.98 rows=11928 width=52) (actual time=1458.334..3811.236 rows=1593426 loops=1) Filter: ((sale_type = 'c'::text) AND (dyear = 2002)) Rows Removed by Filter: 20102909 -> CTE Scan on year_total t_w_firstyear (cost=0.00..13120715.27 rows=3976 width=52) (actual time=917023.470..924167.714 rows=886499 loops=1) Filter: ((year_total > '0'::numeric) AND (sale_type = 'w'::text) AND (dyear = 2001)) Rows Removed by Filter: 20809836 -> CTE Scan on year_total t_w_secyear (cost=0.00..11927922.98 rows=11928 width=52) (actual time=2748.997..3629.952 rows=902694 loops=1) Filter: ((sale_type = 'w'::text) AND (dyear = 2002)) Rows Removed by Filter: 20793641 Planning time: 7.351 ms Execution time: 1877397.375 ms (73 rows)
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers