[
https://issues.apache.org/jira/browse/HIVE-29312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061014#comment-18061014
]
Stamatis Zampetakis commented on HIVE-29312:
--------------------------------------------
In general, it is best to avoid tables with that many partitions since they
tend to cause issues in all kind of places but let's leave this aside for the
moment.
I did some experiments with MySQL 8.4.3 and I do see the impact on the plan
with the proposed transformation. However, in order to see it I used the JSON
formatting for displaying the explain (using
{{{}explain_json_format_version=2{}}}) .
{code:sql}
explain format=json select * from PARTITIONS where TBL_ID=1616 and PART_NAME
like 'appid=0/%' and PART_NAME like '%/usageid=0/%' and PART_NAME like
'%/grassregion=AR/%' and PART_NAME like '%/grassdate=2026-02-24/%';
{code}
{code:json}
{
"query": "...",
"ranges": [
"(TBL_ID = 1616 AND 'appid=0/' <= PART_NAME <=
'appid=0/???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????')"
],
"covering": false,
"operation": "Index range scan on PARTITIONS using UNIQUEPARTITION over
(TBL_ID = 1616 AND 'appid=0/' <= PART_NAME <=
'appid=0/???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'),
with index condition: ((`PARTITIONS`.TBL_ID = 1616) and
(`PARTITIONS`.PART_NAME like 'appid=0/%') and (`PARTITIONS`.PART_NAME like
'%/usageid=0/%') and (`PARTITIONS`.PART_NAME like '%/grassregion=AR/%') and
(`PARTITIONS`.PART_NAME like '%/grassdate=2026-02-24/%'))",
"index_name": "UNIQUEPARTITION",
"query_type": "select",
"table_name": "PARTITIONS",
"access_type": "index",
"schema_name": "hivedb",
"used_columns": [
"PART_ID",
"CREATE_TIME",
"LAST_ACCESS_TIME",
"PART_NAME",
"SD_ID",
"TBL_ID",
"WRITE_ID"
],
"estimated_rows": 1.0,
"index_access_type": "index_range_scan",
"estimated_total_cost": 0.71,
"pushed_index_condition": "((`PARTITIONS`.TBL_ID = 1616) and
(`PARTITIONS`.PART_NAME like 'appid=0/%') and (`PARTITIONS`.PART_NAME like
'%/usageid=0/%') and (`PARTITIONS`.PART_NAME like '%/grassregion=AR/%') and
(`PARTITIONS`.PART_NAME like '%/grassdate=2026-02-24/%'))"
}
{code}
{code:sql}
explain format=json select * from PARTITIONS where TBL_ID=1616 and PART_NAME
like 'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/%';
{code}
{code:json}
{
"query": "...",
"ranges": [
"(TBL_ID = 1616 AND
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/' <= PART_NAME <=
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????')"
],
"covering": false,
"operation": "Index range scan on PARTITIONS using UNIQUEPARTITION over
(TBL_ID = 1616 AND 'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/' <=
PART_NAME <=
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'),
with index condition: ((`PARTITIONS`.TBL_ID = 1616) and
(`PARTITIONS`.PART_NAME like
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/%'))",
"index_name": "UNIQUEPARTITION",
"query_type": "select",
"table_name": "PARTITIONS",
"access_type": "index",
"schema_name": "hivedb",
"used_columns": [
"PART_ID",
"CREATE_TIME",
"LAST_ACCESS_TIME",
"PART_NAME",
"SD_ID",
"TBL_ID",
"WRITE_ID"
],
"estimated_rows": 1.0,
"index_access_type": "index_range_scan",
"estimated_total_cost": 0.71,
"pushed_index_condition": "((`PARTITIONS`.TBL_ID = 1616) and
(`PARTITIONS`.PART_NAME like
'appid=0/usageid=0/grassregion=AR/grassdate=2026-02-24/%'))"
}
{code}
It becomes evident that when there is a single LIKE predicate the range scan
boundaries are more specific so it can have an impact in performance.
After checking the plans in MySQL and Postgres it seems that the impact of the
optimization is somewhat database specific. It does have an effect on MySQL but
not sure if the same holds for Postgres or other systems.
Another aspect that affects the efficiency of the optimization is the ordering
of the partitioning columns and their respective NDVs. It seems that if the
first partitioning column has a high NDV then the transformation proposed here
is not gonna have a big impact on performance while the for low NDV it really
matters.
In addition, if the filter condition does not contain a predicate for the first
partitioning column the whole optimization will not make much of a difference
even when the underlying DBMS is MySQL. If the LIKE predicate starts with a
wildcard then the index range scan will not be usable. Similar observations
hold when the filter condition does not have predicates on every partitioning
column.
Consider for example a table partitioned by (a,b,c,d) and partitioning filter
that says (a=1 and d=1); the generated SQL query will be PART_NAME LIKE 'a=1/%'
AND PART_NAME LIKE '%/d=1' and the only way to collapse this is PART_NAME LIKE
'a=1%d=1' but the latter is not gonna make any difference for the index range
scan.
I am now starting to understand better the impact of the proposed optimization
and indeed there seem to exist some use-cases that can benefit from it.
Nevertheless, there are many preconditions that need to hold in order to
observe a noticable performance gain so I haven't fully made up my mind on how
we should proceed. My initial feeling is that this kind of optimizations are
pretty low-level and we shouldn't worry too much at the Hive/HMS level.
There is a trade-off between code simplicity/maintenance & performance so we
need to carefully weight the pros & cons of each side.
[~wechar] It would be nice to share a few insights on the use-case that
motivated this work in order to understand better how many end-users can
benefit from it.
> Concatenate equality conditions in AND nodes
> --------------------------------------------
>
> Key: HIVE-29312
> URL: https://issues.apache.org/jira/browse/HIVE-29312
> Project: Hive
> Issue Type: Improvement
> Components: Standalone Metastore
> Affects Versions: 4.1.0
> Reporter: Wechar
> Assignee: Wechar
> Priority: Major
> Labels: pull-request-available
>
> Current partition filtert tree is a binary tree with single condition in the
> leaf node, actually the adjacent AND leaf nodes can be combined together to
> match longer index prefix. For example, a table has partition key: a, b, c,
> d, and a filter is {{a=1 and b=1 and c=1}}, when visit as a binary tree
> *AND(AND(a=1, b=1), c=1)*, the filter is:
> {code:sql}
> PART_NAME like 'a=1/%' and PART_NAME like '%/b=2/%' and PART_NAME like
> '%/c=3/%'
> {code}
> If we combine the adjacent AND conditions, the tree will be *MultiAnd(a=1,
> b=2, c=3)*, aand the filte can be:
> {code:bash}
> PART_NAME like 'a=1/b=2/c=3/%'
> {code}
> Obviously, the latter could match a longer index prefix and is more efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)