[ 
https://issues.apache.org/jira/browse/IGNITE-23969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evgeny Stanilovsky reassigned IGNITE-23969:
-------------------------------------------

    Assignee: Evgeny Stanilovsky

> Sql. Improve predicate selectivity estimation
> ---------------------------------------------
>
>                 Key: IGNITE-23969
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23969
>             Project: Ignite
>          Issue Type: Improvement
>          Components: sql
>            Reporter: Konstantin Orlov
>            Assignee: Evgeny Stanilovsky
>            Priority: Major
>              Labels: ignite-3
>
> Let's take a look at query 48 plan from TPS-DS suite:
> {code:java}
> -- start query 1 in stream 0 using template query48.tpl and seed 622697896
> select sum (ss_quantity)
>  from store_sales, store, customer_demographics, customer_address, date_dim
>  where s_store_sk = ss_store_sk
>  and  ss_sold_date_sk = d_date_sk and d_year = 1998
>  and  
>  (
>   (
>    cd_demo_sk = ss_cdemo_sk
>    and 
>    cd_marital_status = 'M'
>    and 
>    cd_education_status = 'Unknown'
>    and 
>    ss_sales_price between 100.00 and 150.00  
>    )
>  or
>   (
>   cd_demo_sk = ss_cdemo_sk
>    and 
>    cd_marital_status = 'W'
>    and 
>    cd_education_status = 'College'
>    and 
>    ss_sales_price between 50.00 and 100.00   
>   )
>  or 
>  (
>   cd_demo_sk = ss_cdemo_sk
>   and 
>    cd_marital_status = 'D'
>    and 
>    cd_education_status = 'Primary'
>    and 
>    ss_sales_price between 150.00 and 200.00  
>  )
>  )
>  and
>  (
>   (
>   ss_addr_sk = ca_address_sk
>   and
>   ca_country = 'United States'
>   and
>   ca_state in ('MI', 'GA', 'NH')
>   and ss_net_profit between 0 and 2000  
>   )
>  or
>   (ss_addr_sk = ca_address_sk
>   and
>   ca_country = 'United States'
>   and
>   ca_state in ('TX', 'KY', 'SD')
>   and ss_net_profit between 150 and 3000 
>   )
>  or
>   (ss_addr_sk = ca_address_sk
>   and
>   ca_country = 'United States'
>   and
>   ca_state in ('NY', 'OH', 'FL')
>   and ss_net_profit between 50 and 25000 
>   )
>  )
> ;
> -- end query 1 in stream 0 using template query48.tpl
> ColocatedHashAggregate
>   Project
>     NestedLoopJoin
>       HashJoin
>         Project(SS_SOLD_DATE_SK=[$30], SS_SOLD_TIME_SK=[$31], 
> SS_ITEM_SK=[$32], SS_CUSTOMER_SK=[$33], SS_CDEMO_SK=[$34], SS_HDEMO_SK=[$35], 
> SS_ADDR_SK=[$36], SS_STORE_SK=[$37], SS_PROMO_SK=[$38], 
> SS_TICKET_NUMBER=[$39], SS_QUANTITY=[$40], SS_WHOLESALE_COST=[$41], 
> SS_LIST_PRICE=[$42], SS_SALES_PRICE=[$43], SS_EXT_DISCOUNT_AMT=[$44], 
> SS_EXT_SALES_PRICE=[$45], SS_EXT_WHOLESALE_COST=[$46], 
> SS_EXT_LIST_PRICE=[$47], SS_EXT_TAX=[$48], SS_COUPON_AMT=[$49], 
> SS_NET_PAID=[$50], SS_NET_PAID_INC_TAX=[$51], SS_NET_PROFIT=[$52], 
> __part=[$53], S_STORE_SK=[$0], S_STORE_ID=[$1], S_REC_START_DATE=[$2], 
> S_REC_END_DATE=[$3], S_CLOSED_DATE_SK=[$4], S_STORE_NAME=[$5], 
> S_NUMBER_EMPLOYEES=[$6], S_FLOOR_SPACE=[$7], S_HOURS=[$8], S_MANAGER=[$9], 
> S_MARKET_ID=[$10], S_GEOGRAPHY_CLASS=[$11], S_MARKET_DESC=[$12], 
> S_MARKET_MANAGER=[$13], S_DIVISION_ID=[$14], S_DIVISION_NAME=[$15], 
> S_COMPANY_ID=[$16], S_COMPANY_NAME=[$17], S_STREET_NUMBER=[$18], 
> S_STREET_NAME=[$19], S_STREET_TYPE=[$20], S_SUITE_NUMBER=[$21], S_CITY=[$22], 
> S_COUNTY=[$23], S_STATE=[$24], S_ZIP=[$25], S_COUNTRY=[$26], 
> S_GMT_OFFSET=[$27], S_TAX_PERCENTAGE=[$28], __part1=[$29], CD_DEMO_SK=[$54], 
> CD_GENDER=[$55], CD_MARITAL_STATUS=[$56], CD_EDUCATION_STATUS=[$57], 
> CD_PURCHASE_ESTIMATE=[$58], CD_CREDIT_RATING=[$59], CD_DEP_COUNT=[$60], 
> CD_DEP_EMPLOYED_COUNT=[$61], CD_DEP_COLLEGE_COUNT=[$62], __part0=[$63]): 
> rowcount = 1000000.0, cumulative cost = IgniteCost 
> [rowCount=7078690.5751953125, cpu=2.5502213524414062E7, 
> memory=8725611.3515625, io=1.0, network=1.260006113515625E8], id = 46738
>           HashJoin
>             Exchange
>               TableScan(table=[[PUBLIC, STORE]])
>             NestedLoopJoin
>               Exchange
>                 TableScan(
>                   table=[[PUBLIC, STORE_SALES]], 
>                   filters=[AND(>=(CAST($t22):DECIMAL(10, 2), 0.00), 
> <=(CAST($t22):DECIMAL(10, 2), 25000.00), >=($t13, 50.00:DECIMAL(5, 2)), 
> <=($t13, 200.00:DECIMAL(5, 2)))]
>                 ): rowcount = 62500.0, cumulative cost = IgniteCost 
> [rowCount=1000000.0], id = 46732
>               Exchange
>                 TableScan(
>                   table=[[PUBLIC, CUSTOMER_DEMOGRAPHICS]], 
>                   filters=[AND(OR(=($t2, _UTF-8'D'), =($t2, _UTF-8'M'), 
> =($t2, _UTF-8'W')), OR(=($t2, _UTF-8'M'), =($t2, _UTF-8'W'), =($t3, 
> _UTF-8'Primary')), OR(=($t2, _UTF-8'D'), =($t2, _UTF-8'M'), =($t3, 
> _UTF-8'College')), OR(=($t2, _UTF-8'M'), =($t3, _UTF-8'College'), =($t3, 
> _UTF-8'Primary')), OR(=($t3, _UTF-8'Unknown'), =($t2, _UTF-8'D'), =($t2, 
> _UTF-8'W')), OR(=($t3, _UTF-8'Primary'), =($t3, _UTF-8'Unknown'), =($t2, 
> _UTF-8'W')), OR(=($t3, _UTF-8'College'), =($t3, _UTF-8'Unknown'), =($t2, 
> _UTF-8'D')), OR(=($t3, _UTF-8'College'), =($t3, _UTF-8'Primary'), =($t3, 
> _UTF-8'Unknown')))]
>                 ): rowcount = 15.2587890625, cumulative cost = IgniteCost 
> [rowCount=1000000.0], id = 46734
>         Exchange
>           TableScan(table=[[PUBLIC, DATE_DIM]])
>       Exchange
>         TableScan(
>           table=[[PUBLIC, CUSTOMER_ADDRESS]], 
>           filters=[AND(=($t10, _UTF-8'United States'), OR(IS NOT NULL($t10), 
> =($t8, _UTF-8'FL'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH')), OR(IS NOT 
> NULL($t10), =($t8, _UTF-8'KY'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')), 
> OR(IS NOT NULL($t10), =($t8, _UTF-8'FL'), =($t8, _UTF-8'KY'), =($t8, 
> _UTF-8'NY'), =($t8, _UTF-8'OH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')), 
> OR(=($t8, _UTF-8'GA'), =($t8, _UTF-8'MI'), =($t8, _UTF-8'NH'), IS NOT 
> NULL($t10)), OR(=($t8, _UTF-8'FL'), =($t8, _UTF-8'GA'), =($t8, _UTF-8'MI'), 
> =($t8, _UTF-8'NH'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH'), IS NOT 
> NULL($t10)), OR(=($t8, _UTF-8'GA'), =($t8, _UTF-8'KY'), =($t8, _UTF-8'MI'), 
> =($t8, _UTF-8'NH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX'), IS NOT 
> NULL($t10)), OR(=($t8, _UTF-8'FL'), =($t8, _UTF-8'GA'), =($t8, _UTF-8'KY'), 
> =($t8, _UTF-8'MI'), =($t8, _UTF-8'NH'), =($t8, _UTF-8'NY'), =($t8, 
> _UTF-8'OH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')))]
>         ): rowcount = 9.1552734375, cumulative cost = IgniteCost 
> [rowCount=1000000.0], id = 46742
>    {code}
> For CUSTOMER_DEMOGRAPHICS, it's estimated that only 15 rows out of 1kk will 
> satisfy the predicate. The same with CUSTOMER_ADDRESS: 9 rows out of 1kk.
> We need to improve this estimation somehow in order to help optimizer to 
> choose more optimal plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to