[ 
https://issues.apache.org/jira/browse/IMPALA-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776438#comment-17776438
 ] 

Joe McDonnell edited comment on IMPALA-12374 at 1/31/26 12:02 AM:
------------------------------------------------------------------

This shows up in TPC-H Q13, which has this condition:
{noformat}
o_comment not like '%special%requests%'{noformat}
Due to the % in the middle, this is not subject to the existing optimizations, 
so it goes to RE2. Replacing it with NOT REGEXP_LIKE(o_comment, 
'special.*requests') shows a speedup. Doing something similar to that either in 
the frontend or in the backend C++ code is interesting.

The code is in be/src/exprs/like-predicate.cc. For example, one way this could 
go is to modify LikePredicate::LikePrepareInternal to add a new optimized case 
for LIKE matches that start and end with a wildcard (but are more complex than 
the more heavily optimized "substring" case, e.g. it would cover '%a%b%', while 
substring case would handle '%a%'). It could set a new function_ to handle this 
case and rework ConvertLikePattern to modify the pattern to add the appropriate 
anchor. LikePredicate::RegexMatch would be modified to use the PartialMatch for 
this case.


was (Author: joemcdonnell):
This shows up in TPC-H Q13, which has this condition:
{noformat}
o_comment not like '%special%requests%'{noformat}
Due to the % in the middle, this is not subject to the existing optimizations, 
so it goes to RE2. Replacing it with NOT REGEXP_LIKE(o_comment, 
'special%requests') shows a speedup. Doing something similar to that either in 
the frontend or in the backend C++ code is interesting.

The code is in be/src/exprs/like-predicate.cc. For example, one way this could 
go is to modify LikePredicate::LikePrepareInternal to add a new optimized case 
for LIKE matches that start and end with a wildcard (but are more complex than 
the more heavily optimized "substring" case, e.g. it would cover '%a%b%', while 
substring case would handle '%a%'). It could set a new function_ to handle this 
case and rework ConvertLikePattern to modify the pattern to add the appropriate 
anchor. LikePredicate::RegexMatch would be modified to use the PartialMatch for 
this case.

> Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE 
> regex
> -----------------------------------------------------------------------------------
>
>                 Key: IMPALA-12374
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12374
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Priority: Major
>              Labels: ramp-up
>
> Abseil has some recommendations about efficiently using re2 here: 
> [https://abseil.io/fast/21]
> One recommendation it has is to avoid leading / trailing .* for FullMatch():
> {noformat}
> Using RE2::FullMatch() with leading or trailing .* is an antipattern. 
> Instead, change it to RE2::PartialMatch() and remove the .*. 
> RE2::PartialMatch() performs an unanchored search, so it is also necessary to 
> anchor the regular expression (i.e. with ^ or $) to indicate that it must 
> match at the start or end of the string.{noformat}
> For our slow path LIKE evaluation, we convert the LIKE to a regular 
> expression and use FullMatch(). Our code to generate the regular expression 
> will use leading/trailing .* and FullMatch for patterns like '%a%b%'. We 
> could try detecting these cases and switching to PartialMatch with anchors. 
> See the link for more details about how this works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to