[
https://issues.apache.org/jira/browse/IMPALA-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060198#comment-18060198
]
ASF subversion and git services commented on IMPALA-12374:
----------------------------------------------------------
Commit e69012e835feb861ad6bda4b93da26829c3d2787 in impala's branch
refs/heads/master from Balazs Hevele
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e69012e83 ]
IMPALA-12374: Optimize trailing/leading % in LIKE
When converting LIKE containing a trailing %, leading %, or both,
to a regular expression, use partial match (with anchors as necessary)
in re2 with '.*' trimmed, instead of a full match with trailing or
leading '.*'.
Note that this optimization only concerns more complex patterns,
e.g. '%a%b%'.
Patterns where the trimmed pattern is a fixed string already use more
optimized checks, like a string search, e.g. '%abc%'.
This optimization can make LIKE matching faster, especially if the
trimmed % covers a long part of the string matched.
The performance gain is highest with both leading and trailing %,
and the lowest with only a trailing %.
In expr-benchmark.cc, a new function BenchmarkLikeRegexp was added to
compare LIKE and regexp_like especially in the relevant cases.
In these tests, a string of 100 characters are used to match the
trailing/leading % wildcard.
Before the change, the performance of the test cases are:
Function iters/ms 10%ile 50%ile 90%ile
--------------------------------------------------------------
like 10.7 10.8 10.9
regex 10.7 10.8 10.9
leading like 18.8 19 19.1
leading regex 68.4 69.4 69.9
trailing like 16.2 16.3 16.6
trailing regex 18.6 18.9 19.1
trailing leading like 9.56 9.6 9.77
trailing leading regex 63.5 64.3 65.1
After the change, the performance of LIKE and regexp_like is about the
same in the relevant cases:
Function iters/ms 10%ile 50%ile 90%ile
--------------------------------------------------------------
like 10.7 10.8 10.9
regex 10.7 10.8 10.9
leading like 67.9 68.7 69.3
leading regex 67.4 68.3 69.1
trailing like 18.5 18.9 19
trailing regex 18.7 18.9 19.1
trailing leading like 63.1 63.9 64.6
trailing leading regex 63.5 63.9 64.8
Testing:
-added new tests to LikePredicate in expr-test.cc to cover relevant
cases
-added like-predicate-test.cc which checks that optimizations are
applied when possible
Change-Id: I37b472e056f791035d25633f17ad8a6e841cdd18
Reviewed-on: http://gerrit.cloudera.org:8080/23932
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Joe McDonnell <[email protected]>
> Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE
> regex
> -----------------------------------------------------------------------------------
>
> Key: IMPALA-12374
> URL: https://issues.apache.org/jira/browse/IMPALA-12374
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 4.3.0
> Reporter: Joe McDonnell
> Assignee: Balazs Hevele
> Priority: Major
> Labels: ramp-up
>
> Abseil has some recommendations about efficiently using re2 here:
> [https://abseil.io/fast/21]
> One recommendation it has is to avoid leading / trailing .* for FullMatch():
> {noformat}
> Using RE2::FullMatch() with leading or trailing .* is an antipattern.
> Instead, change it to RE2::PartialMatch() and remove the .*.
> RE2::PartialMatch() performs an unanchored search, so it is also necessary to
> anchor the regular expression (i.e. with ^ or $) to indicate that it must
> match at the start or end of the string.{noformat}
> For our slow path LIKE evaluation, we convert the LIKE to a regular
> expression and use FullMatch(). Our code to generate the regular expression
> will use leading/trailing .* and FullMatch for patterns like '%a%b%'. We
> could try detecting these cases and switching to PartialMatch with anchors.
> See the link for more details about how this works.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]