[
https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290070#comment-16290070
]
ASF GitHub Bot commented on DRILL-5879:
---------------------------------------
GitHub user sachouche opened a pull request:
https://github.com/apache/drill/pull/1072
DRILL-5879: Improved SQL Pattern Contains Performance
**BACKGROUND**
- JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal
is to improve the "Contains" pattern performance
- [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task)
was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
- This pull-request addresses the algorithmic part of this functionality
**ANALYSIS**
- Contains has O(n*m) complexity
- There are two ways to optimize the associated runtime
1) Minimize the number of instructions, pipelining, and CPU stalls
2) Use smarter algorithms (compared to the naive one); for example, the
Boyer-Moore algorithm (which is implemented by several popular Open Source
tools such as grep)
**IMPLEMENTATION**
Our approach contains both suggestions (listed in the analysis)
- Created five matchers that are chosen based on the pattern length
- The first three, are based on pattern 1) and target patterns with length
[1..4[
- The forth one, has a similar runtime then the current implementation and
targets patterns with length [4..10[
- We use the the Boyer-Moore algorithm for patterns with a length larger
than 9 bytes
NOTE - the JDK doesn't use this algorithm because of two main reasons
- Two extra arrays are necessary with a size relative to the supported
character-set and the pattern length (this would be particularly costly for
unicode as this would require 64k entries)
- Each Contains (or indexOf) invocation would require memory and
initialization overhead
- Drill doesn't have this issue as a) initialization overhead is amortized
since the pattern will be matched against many input values and b) our Contains
logic is centered around bytes so the memory overhead is around 256 integers
per fragment
**PERFORMANCE IMPROVEMENTS**
- We observe at least 25% improvement per Contains operation for matchers
with pattern length lower than 4; 100% for negative cases (as the code never
accesses a secondary for-loop)
- It was noticed the Boyer-Moor algorithm performs poorly for small
patterns as lookup accesses erase the associated optimizations; this algorithm
performs extremely well when the pattern length increases as unlike the naive
implementation it is able to use the lookup table to jump beyond the next
character (since the matching phase has already gained so insight).
- One popular introduction to this algorithm is the following use-case
- Input: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- Pattern: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaax
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachouche/drill DRILL-5879
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1072.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1072
----
commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <[email protected]>
Date: 2017-12-13T22:24:45Z
DRILL-5879: Improved SQL Pattern Contains Performance
----
> Optimize "Like" operator
> ------------------------
>
> Key: DRILL-5879
> URL: https://issues.apache.org/jira/browse/DRILL-5879
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Relational Operators
> Environment: *
> Reporter: salim achouche
> Assignee: salim achouche
> Priority: Minor
> Fix For: 1.13.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like
> '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're
> dealing with the same column twice
> # Optimize the "contains" for-loop
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being
> in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all
> evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and
> will be reused across other LIKE evaluations
> 2)
> * The "Contains" LIKE operation is quite expensive as the code needs to
> access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to
> optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment
> not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes : 27sec
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)