Konstantin Orlov created IGNITE-23969:
-----------------------------------------
Summary: Sql. Improve predicate selectivity estimation
Key: IGNITE-23969
URL: https://issues.apache.org/jira/browse/IGNITE-23969
Project: Ignite
Issue Type: Improvement
Components: sql
Reporter: Konstantin Orlov
Let's take a look at query 48 plan from TPS-DS suite:
{code:java}
-- start query 1 in stream 0 using template query48.tpl and seed 622697896
select sum (ss_quantity)
from store_sales, store, customer_demographics, customer_address, date_dim
where s_store_sk = ss_store_sk
and ss_sold_date_sk = d_date_sk and d_year = 1998
and
(
(
cd_demo_sk = ss_cdemo_sk
and
cd_marital_status = 'M'
and
cd_education_status = 'Unknown'
and
ss_sales_price between 100.00 and 150.00
)
or
(
cd_demo_sk = ss_cdemo_sk
and
cd_marital_status = 'W'
and
cd_education_status = 'College'
and
ss_sales_price between 50.00 and 100.00
)
or
(
cd_demo_sk = ss_cdemo_sk
and
cd_marital_status = 'D'
and
cd_education_status = 'Primary'
and
ss_sales_price between 150.00 and 200.00
)
)
and
(
(
ss_addr_sk = ca_address_sk
and
ca_country = 'United States'
and
ca_state in ('MI', 'GA', 'NH')
and ss_net_profit between 0 and 2000
)
or
(ss_addr_sk = ca_address_sk
and
ca_country = 'United States'
and
ca_state in ('TX', 'KY', 'SD')
and ss_net_profit between 150 and 3000
)
or
(ss_addr_sk = ca_address_sk
and
ca_country = 'United States'
and
ca_state in ('NY', 'OH', 'FL')
and ss_net_profit between 50 and 25000
)
)
;
-- end query 1 in stream 0 using template query48.tpl
ColocatedHashAggregate
Project
NestedLoopJoin
HashJoin
Project(SS_SOLD_DATE_SK=[$30], SS_SOLD_TIME_SK=[$31], SS_ITEM_SK=[$32],
SS_CUSTOMER_SK=[$33], SS_CDEMO_SK=[$34], SS_HDEMO_SK=[$35], SS_ADDR_SK=[$36],
SS_STORE_SK=[$37], SS_PROMO_SK=[$38], SS_TICKET_NUMBER=[$39],
SS_QUANTITY=[$40], SS_WHOLESALE_COST=[$41], SS_LIST_PRICE=[$42],
SS_SALES_PRICE=[$43], SS_EXT_DISCOUNT_AMT=[$44], SS_EXT_SALES_PRICE=[$45],
SS_EXT_WHOLESALE_COST=[$46], SS_EXT_LIST_PRICE=[$47], SS_EXT_TAX=[$48],
SS_COUPON_AMT=[$49], SS_NET_PAID=[$50], SS_NET_PAID_INC_TAX=[$51],
SS_NET_PROFIT=[$52], __part=[$53], S_STORE_SK=[$0], S_STORE_ID=[$1],
S_REC_START_DATE=[$2], S_REC_END_DATE=[$3], S_CLOSED_DATE_SK=[$4],
S_STORE_NAME=[$5], S_NUMBER_EMPLOYEES=[$6], S_FLOOR_SPACE=[$7], S_HOURS=[$8],
S_MANAGER=[$9], S_MARKET_ID=[$10], S_GEOGRAPHY_CLASS=[$11],
S_MARKET_DESC=[$12], S_MARKET_MANAGER=[$13], S_DIVISION_ID=[$14],
S_DIVISION_NAME=[$15], S_COMPANY_ID=[$16], S_COMPANY_NAME=[$17],
S_STREET_NUMBER=[$18], S_STREET_NAME=[$19], S_STREET_TYPE=[$20],
S_SUITE_NUMBER=[$21], S_CITY=[$22], S_COUNTY=[$23], S_STATE=[$24], S_ZIP=[$25],
S_COUNTRY=[$26], S_GMT_OFFSET=[$27], S_TAX_PERCENTAGE=[$28], __part1=[$29],
CD_DEMO_SK=[$54], CD_GENDER=[$55], CD_MARITAL_STATUS=[$56],
CD_EDUCATION_STATUS=[$57], CD_PURCHASE_ESTIMATE=[$58], CD_CREDIT_RATING=[$59],
CD_DEP_COUNT=[$60], CD_DEP_EMPLOYED_COUNT=[$61], CD_DEP_COLLEGE_COUNT=[$62],
__part0=[$63]): rowcount = 1000000.0, cumulative cost = IgniteCost
[rowCount=7078690.5751953125, cpu=2.5502213524414062E7, memory=8725611.3515625,
io=1.0, network=1.260006113515625E8], id = 46738
HashJoin
Exchange
TableScan(table=[[PUBLIC, STORE]])
NestedLoopJoin
Exchange
TableScan(
table=[[PUBLIC, STORE_SALES]],
filters=[AND(>=(CAST($t22):DECIMAL(10, 2), 0.00),
<=(CAST($t22):DECIMAL(10, 2), 25000.00), >=($t13, 50.00:DECIMAL(5, 2)),
<=($t13, 200.00:DECIMAL(5, 2)))]
): rowcount = 62500.0, cumulative cost = IgniteCost
[rowCount=1000000.0], id = 46732
Exchange
TableScan(
table=[[PUBLIC, CUSTOMER_DEMOGRAPHICS]],
filters=[AND(OR(=($t2, _UTF-8'D'), =($t2, _UTF-8'M'), =($t2,
_UTF-8'W')), OR(=($t2, _UTF-8'M'), =($t2, _UTF-8'W'), =($t3, _UTF-8'Primary')),
OR(=($t2, _UTF-8'D'), =($t2, _UTF-8'M'), =($t3, _UTF-8'College')), OR(=($t2,
_UTF-8'M'), =($t3, _UTF-8'College'), =($t3, _UTF-8'Primary')), OR(=($t3,
_UTF-8'Unknown'), =($t2, _UTF-8'D'), =($t2, _UTF-8'W')), OR(=($t3,
_UTF-8'Primary'), =($t3, _UTF-8'Unknown'), =($t2, _UTF-8'W')), OR(=($t3,
_UTF-8'College'), =($t3, _UTF-8'Unknown'), =($t2, _UTF-8'D')), OR(=($t3,
_UTF-8'College'), =($t3, _UTF-8'Primary'), =($t3, _UTF-8'Unknown')))]
): rowcount = 15.2587890625, cumulative cost = IgniteCost
[rowCount=1000000.0], id = 46734
Exchange
TableScan(table=[[PUBLIC, DATE_DIM]])
Exchange
TableScan(
table=[[PUBLIC, CUSTOMER_ADDRESS]],
filters=[AND(=($t10, _UTF-8'United States'), OR(IS NOT NULL($t10),
=($t8, _UTF-8'FL'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH')), OR(IS NOT
NULL($t10), =($t8, _UTF-8'KY'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')), OR(IS
NOT NULL($t10), =($t8, _UTF-8'FL'), =($t8, _UTF-8'KY'), =($t8, _UTF-8'NY'),
=($t8, _UTF-8'OH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX')), OR(=($t8,
_UTF-8'GA'), =($t8, _UTF-8'MI'), =($t8, _UTF-8'NH'), IS NOT NULL($t10)),
OR(=($t8, _UTF-8'FL'), =($t8, _UTF-8'GA'), =($t8, _UTF-8'MI'), =($t8,
_UTF-8'NH'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH'), IS NOT NULL($t10)),
OR(=($t8, _UTF-8'GA'), =($t8, _UTF-8'KY'), =($t8, _UTF-8'MI'), =($t8,
_UTF-8'NH'), =($t8, _UTF-8'SD'), =($t8, _UTF-8'TX'), IS NOT NULL($t10)),
OR(=($t8, _UTF-8'FL'), =($t8, _UTF-8'GA'), =($t8, _UTF-8'KY'), =($t8,
_UTF-8'MI'), =($t8, _UTF-8'NH'), =($t8, _UTF-8'NY'), =($t8, _UTF-8'OH'), =($t8,
_UTF-8'SD'), =($t8, _UTF-8'TX')))]
): rowcount = 9.1552734375, cumulative cost = IgniteCost
[rowCount=1000000.0], id = 46742
{code}
For CUSTOMER_DEMOGRAPHICS, it's estimated that only 15 rows out of 1kk will
satisfy the predicate. The same with CUSTOMER_ADDRESS: 9 rows out of 1kk.
We need to improve this estimation somehow in order to help optimizer to choose
more optimal plans.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)