[jira] [Created] (IGNITE-23969) Sql. Improve predicate selectivity estimation

Konstantin Orlov (Jira) Thu, 12 Dec 2024 05:27:04 -0800

Konstantin Orlov created IGNITE-23969:
-----------------------------------------


             Summary: Sql. Improve predicate selectivity estimation
                 Key: IGNITE-23969
                 URL: https://issues.apache.org/jira/browse/IGNITE-23969
             Project: Ignite
          Issue Type: Improvement
          Components: sql
            Reporter: Konstantin Orlov


Let's take a look at query 48 plan from TPS-DS suite:
{code:java}
-- start query 1 in stream 0 using template query48.tpl and seed 622697896
select sum (ss_quantity)
 from store_sales, store, customer_demographics, customer_address, date_dim
 where s_store_sk = ss_store_sk
 and  ss_sold_date_sk = d_date_sk and d_year = 1998
 and  
 (
  (
   cd_demo_sk = ss_cdemo_sk
   and 
   cd_marital_status = 'M'
   and 
   cd_education_status = 'Unknown'
   and 
   ss_sales_price between 100.00 and 150.00  
   )
 or
  (
  cd_demo_sk = ss_cdemo_sk
   and 
   cd_marital_status = 'W'
   and 
   cd_education_status = 'College'
   and 
   ss_sales_price between 50.00 and 100.00   
  )
 or 
 (
  cd_demo_sk = ss_cdemo_sk
  and 
   cd_marital_status = 'D'
   and 
   cd_education_status = 'Primary'
   and 
   ss_sales_price between 150.00 and 200.00  
 )
 )
 and
 (
  (
  ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('MI', 'GA', 'NH')
  and ss_net_profit between 0 and 2000  
  )
 or
  (ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('TX', 'KY', 'SD')
  and ss_net_profit between 150 and 3000 
  )
 or
  (ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('NY', 'OH', 'FL')
  and ss_net_profit between 50 and 25000 
  )
 )
;

-- end query 1 in stream 0 using template query48.tpl

ColocatedHashAggregate
  Project
    NestedLoopJoin
      HashJoin
        Project(SS_SOLD_DATE_SK=[$30], SS_SOLD_TIME_SK=[$31], SS_ITEM_SK=[$32], 
SS_CUSTOMER_SK=[$33], SS_CDEMO_SK=[$34], SS_HDEMO_SK=[$35], SS_ADDR_SK=[$36], 
SS_STORE_SK=[$37], SS_PROMO_SK=[$38], SS_TICKET_NUMBER=[$39], 
SS_QUANTITY=[$40], SS_WHOLESALE_COST=[$41], SS_LIST_PRICE=[$42], 
SS_SALES_PRICE=[$43], SS_EXT_DISCOUNT_AMT=[$44], SS_EXT_SALES_PRICE=[$45], 
SS_EXT_WHOLESALE_COST=[$46], SS_EXT_LIST_PRICE=[$47], SS_EXT_TAX=[$48], 
SS_COUPON_AMT=[$49], SS_NET_PAID=[$50], SS_NET_PAID_INC_TAX=[$51], 
SS_NET_PROFIT=[$52], __part=[$53], S_STORE_SK=[$0], S_STORE_ID=[$1], 
S_REC_START_DATE=[$2], S_REC_END_DATE=[$3], S_CLOSED_DATE_SK=[$4], 
S_STORE_NAME=[$5], S_NUMBER_EMPLOYEES=[$6], S_FLOOR_SPACE=[$7], S_HOURS=[$8], 
S_MANAGER=[$9], S_MARKET_ID=[$10], S_GEOGRAPHY_CLASS=[$11], 
S_MARKET_DESC=[$12], S_MARKET_MANAGER=[$13], S_DIVISION_ID=[$14], 
S_DIVISION_NAME=[$15], S_COMPANY_ID=[$16], S_COMPANY_NAME=[$17], 
S_STREET_NUMBER=[$18], S_STREET_NAME=[$19], S_STREET_TYPE=[$20], 
S_SUITE_NUMBER=[$21], S_CITY=[$22], S_COUNTY=[$23], S_STATE=[$24], S_ZIP=[$25], 
S_COUNTRY=[$26], S_GMT_OFFSET=[$27], S_TAX_PERCENTAGE=[$28], __part1=[$29], 
CD_DEMO_SK=[$54], CD_GENDER=[$55], CD_MARITAL_STATUS=[$56], 
CD_EDUCATION_STATUS=[$57], CD_PURCHASE_ESTIMATE=[$58], CD_CREDIT_RATING=[$59], 
CD_DEP_COUNT=[$60], CD_DEP_EMPLOYED_COUNT=[$61], CD_DEP_COLLEGE_COUNT=[$62], 
__part0=[$63]): rowcount = 1000000.0, cumulative cost = IgniteCost 
[rowCount=7078690.5751953125, cpu=2.5502213524414062E7, memory=8725611.3515625, 
io=1.0, network=1.260006113515625E8], id = 46738
          HashJoin
            Exchange
              TableScan(table=[[PUBLIC, STORE]])
            NestedLoopJoin
              Exchange
                TableScan(
                  table=[[PUBLIC, STORE_SALES]], 
                  filters=[AND(>=(CAST($t22):DECIMAL(10, 2), 0.00), 
<=(CAST($t22):DECIMAL(10, 2), 25000.00), >=($t13, 50.00:DECIMAL(5, 2)), 
<=($t13, 200.00:DECIMAL(5, 2)))]
                ): rowcount = 62500.0, cumulative cost = IgniteCost 
[rowCount=1000000.0], id = 46732
              Exchange
                TableScan(
                  table=[[PUBLIC, CUSTOMER_DEMOGRAPHICS]], 
                  filters=[AND(OR(=($t2, _UTF-8'D'), =($t2, _UTF-8'M'), =($t2, 
_UTF-8'W')), OR(=($t2, _UTF-8'M'), =($t2, _UTF-8'W'), =($t3, _UTF-8'Primary')), 
OR(=($t2, _UTF-8'D'), =($t2, _UTF-8'M'), =($t3, _UTF-8'College')), OR(=($t2, 
_UTF-8'M'), =($t3, _UTF-8'College'), =($t3, _UTF-8'Primary')), OR(=($t3, 
_UTF-8'Unknown'), =($t2, _UTF-8'D'), =($t2, _UTF-8'W')), OR(=($t3, 
_UTF-8'Primary'), =($t3, _UTF-8'Unknown'), =($t2, _UTF-8'W')), OR(=($t3, 
_UTF-8'College'), =($t3, _UTF-8'Unknown'), =($t2, _UTF-8'D')), OR(=($t3, 
_UTF-8'College'), =($t3, _UTF-8'Primary'), =($t3, _UTF-8'Unknown')))]
                ): rowcount = 15.2587890625, cumulative cost = IgniteCost 
[rowCount=1000000.0], id = 46734
        Exchange
          TableScan(table=[[PUBLIC, DATE_DIM]])
      Exchange
        TableScan(
          table=[[PUBLIC, CUSTOMER_ADDRESS]], 
          filters=[AND(=($t10, _UTF-8'United States'), OR(IS NOT NULL($t10), 
=($t8, _UTF-8'FL'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH')), OR(IS NOT 
NULL($t10), =($t8, _UTF-8'KY'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')), OR(IS 
NOT NULL($t10), =($t8, _UTF-8'FL'), =($t8, _UTF-8'KY'), =($t8, _UTF-8'NY'), 
=($t8, _UTF-8'OH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')), OR(=($t8, 
_UTF-8'GA'), =($t8, _UTF-8'MI'), =($t8, _UTF-8'NH'), IS NOT NULL($t10)), 
OR(=($t8, _UTF-8'FL'), =($t8, _UTF-8'GA'), =($t8, _UTF-8'MI'), =($t8, 
_UTF-8'NH'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH'), IS NOT NULL($t10)), 
OR(=($t8, _UTF-8'GA'), =($t8, _UTF-8'KY'), =($t8, _UTF-8'MI'), =($t8, 
_UTF-8'NH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX'), IS NOT NULL($t10)), 
OR(=($t8, _UTF-8'FL'), =($t8, _UTF-8'GA'), =($t8, _UTF-8'KY'), =($t8, 
_UTF-8'MI'), =($t8, _UTF-8'NH'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH'), =($t8, 
_UTF-8'SD'), =($t8, _UTF-8'TX')))]
        ): rowcount = 9.1552734375, cumulative cost = IgniteCost 
[rowCount=1000000.0], id = 46742
   {code}

For CUSTOMER_DEMOGRAPHICS, it's estimated that only 15 rows out of 1kk will 
satisfy the predicate. The same with CUSTOMER_ADDRESS: 9 rows out of 1kk.

We need to improve this estimation somehow in order to help optimizer to choose 
more optimal plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-23969) Sql. Improve predicate selectivity estimation

Reply via email to