[
https://issues.apache.org/jira/browse/HIVE-29312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061673#comment-18061673
]
Wechar commented on HIVE-29312:
-------------------------------
*Background*
In our prod env, some user has the traffic dwd tables that have region, date,
hour, event_type etc partition columns, and would produce millions partitions,
the table users may scan such tables with some specific region, date and some
other partition filters.
Our backend MySQL will suffer high load when such query is frequent, so we want
to improve it.
{quote}Consider for example a table partitioned by (a,b,c,d) and partitioning
filter that says (a=1 and d=1); the generated SQL query will be PART_NAME LIKE
'a=1/%' AND PART_NAME LIKE '%/d=1' and the only way to collapse this is
PART_NAME LIKE 'a=1%d=1' but the latter is not gonna make any difference for
the index range scan.
{quote}
I have add a new bench mark test *multiPartialAnd* which is {*}`p_a`='a0' and
`p_c`='c0'{*}, and the result still shows some improvements for huge partitions:
{code:bash}
- Before
Operation Mean Med Min Max Err%
getPartitionsByFilter#multiAnd.100 4.174 4.117 3.944 5.895 5.777
getPartitionsByFilter#multiAnd.10000 6.947 6.872 6.542 8.931 4.851
getPartitionsByFilter#multiAnd.1000000 272.1 268.9 267.0 362.8 4.296
getPartitionsByFilter#multiOr.100 5.405 4.607 4.083 16.91 48.07
getPartitionsByFilter#multiOr.10000 8.038 7.871 7.595 18.75 13.77
getPartitionsByFilter#multiOr.1000000 368.5 365.0 362.6 563.6 5.565
getPartitionsByFilter#multiPartialAnd.100 4.271 4.087 3.881 11.39
22.53
getPartitionsByFilter#multiPartialAnd.10000 6.805 6.682 6.448 11.59
9.627
getPartitionsByFilter#multiPartialAnd.1000000 268.7 267.7 266.2 281.7
1.127
getPartitionsByFilter#simple.100 4.705 4.439 3.983 15.68 28.76
getPartitionsByFilter#simple.10000 6.440 6.341 6.026 9.197 5.792
getPartitionsByFilter#simple.1000000 244.9 237.8 234.8 290.2 5.477
- After
Operation Mean Med Min Max Err%
getPartitionsByFilter#multiAnd.100 4.087 4.048 3.852 4.817 4.399
getPartitionsByFilter#multiAnd.10000 6.659 6.566 6.057 8.500 7.293
getPartitionsByFilter#multiAnd.1000000 233.6 231.3 230.1 331.3 4.601
getPartitionsByFilter#multiOr.100 4.378 4.340 4.058 6.571 6.313
getPartitionsByFilter#multiOr.10000 7.793 7.738 7.541 8.429 2.133
getPartitionsByFilter#multiOr.1000000 372.3 368.3 365.9 452.6 3.291
getPartitionsByFilter#multiPartialAnd.100 4.082 4.053 3.761 4.626
4.254
getPartitionsByFilter#multiPartialAnd.10000 6.604 6.460 6.213 11.09
9.160
getPartitionsByFilter#multiPartialAnd.1000000 247.8 246.7 245.5 269.3
1.361
getPartitionsByFilter#simple.100 4.366 4.276 3.962 6.683 8.101
getPartitionsByFilter#simple.10000 6.355 6.292 6.106 8.225 4.007
getPartitionsByFilter#simple.1000000 239.9 233.6 232.0 342.6 5.944
{code}
I think the improvement is because the regex match times become less even if
the hitting index length is the same.
> Concatenate equality conditions in AND nodes
> --------------------------------------------
>
> Key: HIVE-29312
> URL: https://issues.apache.org/jira/browse/HIVE-29312
> Project: Hive
> Issue Type: Improvement
> Components: Standalone Metastore
> Affects Versions: 4.1.0
> Reporter: Wechar
> Assignee: Wechar
> Priority: Major
> Labels: pull-request-available
>
> Current partition filtert tree is a binary tree with single condition in the
> leaf node, actually the adjacent AND leaf nodes can be combined together to
> match longer index prefix. For example, a table has partition key: a, b, c,
> d, and a filter is {{a=1 and b=1 and c=1}}, when visit as a binary tree
> *AND(AND(a=1, b=1), c=1)*, the filter is:
> {code:sql}
> PART_NAME like 'a=1/%' and PART_NAME like '%/b=2/%' and PART_NAME like
> '%/c=3/%'
> {code}
> If we combine the adjacent AND conditions, the tree will be *MultiAnd(a=1,
> b=2, c=3)*, aand the filte can be:
> {code:bash}
> PART_NAME like 'a=1/b=2/c=3/%'
> {code}
> Obviously, the latter could match a longer index prefix and is more efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)