Here is one other thing I could learn from TPC-DS benchmark.

The attached query is Q4 of TPC-DS, and its result was towards SF=100.
It took long time to compete (about 30min), please see the attached
EXPLAIN ANALYZE output.

Its core workload is placed on CTE year_total. Its Append node has
underlying three HashAggregate nodes which also has tables join for
each.

Below shows the first HashAggregate node. It consumes 268M rows, then
generates 9M rows. Underlying GpuJoin takes 74sec to process 268M rows,
so we can guess HashAggregate consumed 400sec.

  HashAggregate  (cost=18194948.40..21477516.00 rows=262605408 width=178) 
(actual time=480652.856..488055.918 rows=9142442 loops=1)
    Group Key: customer.c_customer_id, customer.c_first_name, 
customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, 
customer.c_login, customer.c_email_address, date_dim.d_year
    ->  Custom Scan (GpuJoin)  (cost=101342.54..9660272.64 rows=262605408 
width=178) (actual time=2472.174..73908.894 rows=268562375 loops=1)
          Bulkload: On (density: 100.00%)
          Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: 
(ss_sold_date_sk = d_date_sk), nrows (287997024 -> 275041999, 95.50% expected 
95.47%)
          Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: 
(ss_customer_sk = c_customer_sk), nrows (275041999 -> 268562375, 93.25% 
expected 91.18%)
          ->  Custom Scan (BulkScan) on store_sales  (cost=0.00..9649559.60 
rows=287996960 width=38) (actual time=18.372..54522.803 rows=287997024 loops=1)
          ->  Seq Scan on date_dim  (cost=0.00..2705.49 rows=73049 width=16) 
(actual time=0.015..15.533 rows=73049 loops=1)
          ->  Seq Scan on customer  (cost=0.00..87141.74 rows=2000074 
width=156) (actual time=0.018..582.373 rows=2000000 loops=1)

Other two HashAggregate nodes have similar behavior. The second one
consumed 281sec, including 30sec by underlying GpuJoin. The third one
consumed 138sec, including 25sec by underlying GpuJoin.

Apart from my custom join implementation, It seems to me HashAggregate
node consumed too much time than expectation.

One characteristics of this workload is, this aggregation takes eight
grouping-keys. I doubt cost of function invocation for hash-value and
equal-checks may be criminal.


Let's dive into nodeAgg.c.
ExecAgg() calls agg_fill_hash_table() to fill up the hash table with
rows fetched from the underlying tables. On construction of the hash
table, it calls hash functions (at TupleHashTableHash) and equal check
functions (at execTuplesMatch) repeatedly.
Both of them uses FunctionCallX() interface that exchange the argument
using FmgrInfo structure. Usually, it is not the best way from performance
perspective. Especially, this workload takes 268M input rows and
8 grouping keys, so 268M (rows) x 8 (grouping keys) x 2 (for hash/equal),
4.2B times function calls via FmgrInfo shall happen.

I think SortSupport logic provides a reasonable way to solve this
kind of problem. For example, btint4sortsupport() informs a function
pointer of the fast version of comparator (btint4fastcmp) which takes
two Datum argument without indirect memory reference.
This mechanism will also make sense for HashAggregate logic, to reduce
the cost of function invocations.

Please comment on the idea I noticed here.


And, as an aside, if HashAggregate picks up a hash-value of grouping-keys
from the target-list of underlying plan node (GpuJoin in this case), it
may be possible to off-load calculation of hash-values on GPU device. :-)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Thursday, August 13, 2015 8:23 PM
> To: Greg Stark
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable 
> plan
> 
> > On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > > In fact, cost of HashJoin underlying Sort node is:
> > >     ->  Hash Join  (cost=621264.91..752685.48 rows=1 width=132)
> > >
> > > On the other hands, NestedLoop on same place is:
> > >     ->  Nested Loop  (cost=0.00..752732.26 rows=1 width=132)
> > >
> > > Probably, small GUC adjustment may make optimizer to prefer HashJoin 
> > > towards
> > > these kind of queries.
> >
> > With that kind of discrepancy I doubt adjusting GUCs will be sufficient
> >
> > > Do you have a good idea?
> >
> > Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any
> > row estimates that are way off?
> >
> Yes, EXPLAIN ANALYZE is attached.
> 
> According to this, CTE year_total generates 384,208 rows. It is much smaller
> than estimation (4.78M rows), however, filter's selectivity of CTE Scan was
> not large as expectation.
> For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even 
> though
> 40 rows were expected. On the next level, relations join between 11324 rows 
> and
> 9952 rows, towards to estimation of 40rows x 8 rows.
> If NestLoop is placed instead of HashJoin, it will make an explosion of the 
> number
> of loops.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kai...@ak.jp.nec.com>

EXPLAIN ANALYZE
with year_total as (
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
       
,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
 year_total
       ,'s' sale_type
 from customer
     ,store_sales
     ,date_dim
 where c_customer_sk = ss_customer_sk
   and ss_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
 union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
       
,sum((((cs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
 ) year_total
       ,'c' sale_type
 from customer
     ,catalog_sales
     ,date_dim
 where c_customer_sk = cs_bill_customer_sk
   and cs_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
       
,sum((((ws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
 ) year_total
       ,'w' sale_type
 from customer
     ,web_sales
     ,date_dim
 where c_customer_sk = ws_bill_customer_sk
   and ws_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
         )
  select
                  t_s_secyear.customer_id
                 ,t_s_secyear.customer_first_name
                 ,t_s_secyear.customer_last_name
                 ,t_s_secyear.customer_email_address
 from year_total t_s_firstyear
     ,year_total t_s_secyear
     ,year_total t_c_firstyear
     ,year_total t_c_secyear
     ,year_total t_w_firstyear
     ,year_total t_w_secyear
 where t_s_secyear.customer_id = t_s_firstyear.customer_id
   and t_s_firstyear.customer_id = t_c_secyear.customer_id
   and t_s_firstyear.customer_id = t_c_firstyear.customer_id
   and t_s_firstyear.customer_id = t_w_firstyear.customer_id
   and t_s_firstyear.customer_id = t_w_secyear.customer_id
   and t_s_firstyear.sale_type = 's'
   and t_c_firstyear.sale_type = 'c'
   and t_w_firstyear.sale_type = 'w'
   and t_s_secyear.sale_type = 's'
   and t_c_secyear.sale_type = 'c'
   and t_w_secyear.sale_type = 'w'
   and t_s_firstyear.dyear =  2001
   and t_s_secyear.dyear = 2001+1
   and t_c_firstyear.dyear =  2001
   and t_c_secyear.dyear =  2001+1
   and t_w_firstyear.dyear = 2001
   and t_w_secyear.dyear = 2001+1
   and t_s_firstyear.year_total > 0
   and t_c_firstyear.year_total > 0
   and t_w_firstyear.year_total > 0
   and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / 
t_c_firstyear.year_total else null end
           > case when t_s_firstyear.year_total > 0 then t_s_secyear.year_total 
/ t_s_firstyear.year_total else null end
   and case when t_c_firstyear.year_total > 0 then t_c_secyear.year_total / 
t_c_firstyear.year_total else null end
           > case when t_w_firstyear.year_total > 0 then t_w_secyear.year_total 
/ t_w_firstyear.year_total else null end
 order by t_s_secyear.customer_id
         ,t_s_secyear.customer_first_name
         ,t_s_secyear.customer_last_name
         ,t_s_secyear.customer_email_address
limit 100;
                                                                                
                              QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2374127478.27..2374127478.52 rows=100 width=132) (actual 
time=1873138.288..1873138.300 rows=100 loops=1)
   CTE year_total
     ->  Append  (cost=18194948.40..46107899.67 rows=477116919 width=220) 
(actual time=480652.857..909366.179 rows=21696335 loops=1)
           ->  HashAggregate  (cost=18194948.40..21477516.00 rows=262605408 
width=178) (actual time=480652.856..488055.918 rows=9142442 loops=1)
                 Group Key: customer.c_customer_id, customer.c_first_name, 
customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, 
customer.c_login, customer.c_email_address, date_dim.d_year
                 ->  Custom Scan (GpuJoin)  (cost=101342.54..9660272.64 
rows=262605408 width=178) (actual time=2472.174..73908.894 rows=268562375 
loops=1)
                       Bulkload: On (density: 100.00%)
                       Depth 1: Logic: GpuHashJoin, HashKeys: 
(ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows (287997024 -> 
275041999, 95.50% expected 95.47%)
                       Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), 
JoinQual: (ss_customer_sk = c_customer_sk), nrows (275041999 -> 268562375, 
93.25% expected 91.18%)
                       ->  Custom Scan (BulkScan) on store_sales  
(cost=0.00..9649559.60 rows=287996960 width=38) (actual time=18.372..54522.803 
rows=287997024 loops=1)
                       ->  Seq Scan on date_dim  (cost=0.00..2705.49 rows=73049 
width=16) (actual time=0.015..15.533 rows=73049 loops=1)
                       ->  Seq Scan on customer  (cost=0.00..87141.74 
rows=2000074 width=156) (actual time=0.018..582.373 rows=2000000 loops=1)
           ->  HashAggregate  (cost=11391014.70..13172962.90 rows=142555856 
width=181) (actual time=274756.424..281330.219 rows=8037433 loops=1)
                 Group Key: customer_1.c_customer_id, customer_1.c_first_name, 
customer_1.c_last_name, customer_1.c_preferred_cust_flag, 
customer_1.c_birth_country, customer_1.c_login, customer_1.c_email_address, 
date_dim_1.d_year
                 ->  Custom Scan (GpuJoin)  (cost=101368.56..6757949.38 
rows=142555856 width=181) (actual time=1215.477..49357.736 rows=142917412 
loops=1)
                       Bulkload: On (density: 100.00%)
                       Depth 1: Logic: GpuHashJoin, HashKeys: 
(cs_bill_customer_sk), JoinQual: (c_customer_sk = cs_bill_customer_sk), nrows 
(143997065 -> 143277592, 99.50% expected 99.48%)
                       Depth 2: Logic: GpuHashJoin, HashKeys: 
(cs_sold_date_sk), JoinQual: (cs_sold_date_sk = d_date_sk), nrows (143277592 -> 
142917412, 99.25% expected 99.00%)
                       ->  Custom Scan (BulkScan) on catalog_sales  
(cost=0.00..6555775.08 rows=143997008 width=41) (actual time=15.789..39415.420 
rows=143997065 loops=1)
                       ->  Seq Scan on customer customer_1  
(cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.012..264.787 
rows=2000000 loops=1)
                       ->  Seq Scan on date_dim date_dim_1  (cost=0.00..2705.49 
rows=73049 width=16) (actual time=0.004..9.851 rows=73049 loops=1)
           ->  HashAggregate  (cost=5786805.89..6686251.57 rows=71955655 
width=181) (actual time=134861.430..138412.674 rows=4516460 loops=1)
                 Group Key: customer_2.c_customer_id, customer_2.c_first_name, 
customer_2.c_last_name, customer_2.c_preferred_cust_flag, 
customer_2.c_birth_country, customer_2.c_login, customer_2.c_email_address, 
date_dim_2.d_year
                 ->  Custom Scan (GpuJoin)  (cost=101379.26..3448247.10 
rows=71955655 width=181) (actual time=1071.093..25145.723 rows=71974446 loops=1)
                       Bulkload: On (density: 100.00%)
                       Depth 1: Logic: GpuHashJoin, HashKeys: 
(ws_bill_customer_sk), JoinQual: (c_customer_sk = ws_bill_customer_sk), nrows 
(72001237 -> 71983355, 99.98% expected 99.96%)
                       Depth 2: Logic: GpuHashJoin, HashKeys: 
(ws_sold_date_sk), JoinQual: (ws_sold_date_sk = d_date_sk), nrows (71983355 -> 
71974446, 99.96% expected 99.94%)
                       ->  Custom Scan (BulkScan) on web_sales  
(cost=0.00..3290612.48 rows=72001248 width=41) (actual time=11.205..19596.402 
rows=72001237 loops=1)
                       ->  Seq Scan on customer customer_2  
(cost=0.00..87141.74 rows=2000074 width=156) (actual time=0.012..264.418 
rows=2000000 loops=1)
                       ->  Seq Scan on date_dim date_dim_2  (cost=0.00..2705.49 
rows=73049 width=16) (actual time=0.004..9.406 rows=73049 loops=1)
   ->  Sort  (cost=2328019578.60..2420615155.73 rows=37038230850 width=132) 
(actual time=1873138.287..1873138.293 rows=100 loops=1)
         Sort Key: t_s_secyear.customer_id, t_s_secyear.customer_first_name, 
t_s_secyear.customer_last_name, t_s_secyear.customer_email_address
         Sort Method: top-N heapsort  Memory: 50kB
         ->  Custom Scan (GpuJoin)  (cost=62321910.10..912445027.90 
rows=37038230850 width=132) (actual time=1871500.013..1873137.441 rows=1812 
loops=1)
               Filter: (CASE WHEN (year_total > '0'::numeric) THEN (year_total 
/ year_total) ELSE NULL::numeric END > CASE WHEN (year_total > '0'::numeric) 
THEN (year_total / year_total) ELSE NULL::numer
ic END)
               Rows Removed by Filter: 768
               Bulkload: Off
               Depth 1: Logic: GpuHashJoin, HashKeys: (customer_id), JoinQual: 
(customer_id = customer_id), nrows (12889 -> 5689, 44.14% expected 1988.00%)
               Depth 2: Logic: GpuHashJoin, HashKeys: (customer_id), JoinQual: 
(customer_id = customer_id), nrows (5689 -> 2580, 20.02% expected 39521.44%)
               ->  Custom Scan (GpuJoin)  (cost=36984981.70..52218645.48 
rows=93716805 width=256) (actual time=20512.681..943222.290 rows=12889 loops=1)
                     Filter: (CASE WHEN (year_total > '0'::numeric) THEN 
(year_total / year_total) ELSE NULL::numeric END > CASE WHEN (year_total > 
'0'::numeric) THEN (year_total / year_total) ELSE NULL::numeric END)
                     Rows Removed by Filter: 13531
                     Bulkload: Off
                     Depth 1: Logic: GpuHashJoin, HashKeys: (customer_id), 
JoinQual: (customer_id = customer_id), nrows (1816438 -> 42439, 2.34% expected 
100.00%)
                     Depth 2: Logic: GpuHashJoin, HashKeys: (customer_id), 
JoinQual: (customer_id = customer_id), nrows (42439 -> 26420, 1.45% expected 
2357062.50%)
                     ->  CTE Scan on year_total t_s_firstyear  
(cost=0.00..13120715.27 rows=3976 width=52) (actual time=0.020..5425.980 
rows=1816438 loops=1)
                           Filter: ((year_total > '0'::numeric) AND (sale_type 
= 's'::text) AND (dyear = 2001))
                           Rows Removed by Filter: 19879897
                     ->  CTE Scan on year_total t_s_secyear  
(cost=0.00..11927922.98 rows=11928 width=164) (actual time=0.007..45.249 
rows=46636 loops=1)
                           Filter: ((sale_type = 's'::text) AND (dyear = 2002))
                           Rows Removed by Filter: 185596
                     ->  Merge Join  (cost=25049683.60..25053260.42 rows=237129 
width=104) (actual time=14017.124..15376.421 rows=1243815 loops=1)
                           Merge Cond: (t_c_firstyear.customer_id = 
t_c_secyear.customer_id)
                           ->  Sort  (cost=13120952.98..13120962.92 rows=3976 
width=52) (actual time=7764.765..8009.308 rows=1561656 loops=1)
                                 Sort Key: t_c_firstyear.customer_id
                                 Sort Method: quicksort  Memory: 171157kB
                                 ->  CTE Scan on year_total t_c_firstyear  
(cost=0.00..13120715.27 rows=3976 width=52) (actual time=2116.763..5383.853 
rows=1561657 loops=1)
                                       Filter: ((year_total > '0'::numeric) AND 
(sale_type = 'c'::text) AND (dyear = 2001))
                                       Rows Removed by Filter: 20134678
                           ->  Sort  (cost=11928730.62..11928760.44 rows=11928 
width=52) (actual time=6252.341..6503.035 rows=1593426 loops=1)
                                 Sort Key: t_c_secyear.customer_id
                                 Sort Method: quicksort  Memory: 173639kB
                                 ->  CTE Scan on year_total t_c_secyear  
(cost=0.00..11927922.98 rows=11928 width=52) (actual time=1458.334..3811.236 
rows=1593426 loops=1)
                                       Filter: ((sale_type = 'c'::text) AND 
(dyear = 2002))
                                       Rows Removed by Filter: 20102909
               ->  CTE Scan on year_total t_w_firstyear  
(cost=0.00..13120715.27 rows=3976 width=52) (actual time=917023.470..924167.714 
rows=886499 loops=1)
                     Filter: ((year_total > '0'::numeric) AND (sale_type = 
'w'::text) AND (dyear = 2001))
                     Rows Removed by Filter: 20809836
               ->  CTE Scan on year_total t_w_secyear  (cost=0.00..11927922.98 
rows=11928 width=52) (actual time=2748.997..3629.952 rows=902694 loops=1)
                     Filter: ((sale_type = 'w'::text) AND (dyear = 2002))
                     Rows Removed by Filter: 20793641
 Planning time: 7.351 ms
 Execution time: 1877397.375 ms
(73 rows)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to