[ 
https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290070#comment-16290070
 ] 

ASF GitHub Bot commented on DRILL-5879:
---------------------------------------

GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1072

    DRILL-5879: Improved SQL Pattern Contains Performance

    **BACKGROUND**
    - JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal 
is to improve the "Contains" pattern performance
    - [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) 
was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
    - This pull-request addresses the algorithmic part of this functionality
    
    **ANALYSIS**
    - Contains has O(n*m) complexity
    - There are two ways to optimize the associated runtime
    1) Minimize the number of instructions, pipelining, and CPU stalls
    2) Use smarter algorithms (compared to the naive one); for example, the 
Boyer-Moore algorithm (which is implemented by several popular Open Source 
tools such as grep)
    
    **IMPLEMENTATION**
    Our approach contains both suggestions (listed in the analysis)
    - Created five matchers that are chosen based on the pattern length
    - The first three, are based on pattern 1) and target patterns with length 
[1..4[ 
    - The forth one, has a similar runtime then the current implementation and 
targets patterns with length [4..10[
    - We use the the Boyer-Moore algorithm for patterns with a length larger 
than 9 bytes
    NOTE - the JDK doesn't use this algorithm because of two main  reasons
    - Two extra arrays are necessary with a size relative to the supported 
character-set and the pattern length (this would be particularly costly for 
unicode as this would require 64k entries)
    - Each Contains (or indexOf) invocation would require memory and 
initialization overhead
    - Drill doesn't have this issue as a) initialization overhead is amortized 
since the pattern will be matched against many input values and b) our Contains 
logic is centered around bytes so the memory overhead is around 256 integers 
per fragment
    
    **PERFORMANCE IMPROVEMENTS**
    - We observe at least 25% improvement per Contains operation for matchers 
with pattern length lower than 4; 100% for negative cases (as the code never 
accesses a secondary for-loop)
    - It was noticed the Boyer-Moor algorithm performs poorly for small 
patterns as lookup accesses erase the associated optimizations; this algorithm 
performs extremely well when the pattern length increases as unlike the naive 
implementation it is able to use the lookup table to jump beyond the next 
character (since the matching phase has already gained so insight).
    - One popular introduction to this algorithm is the following use-case
    - Input: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    - Pattern: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaax
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill DRILL-5879

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1072
    
----
commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <[email protected]>
Date:   2017-12-13T22:24:45Z

    DRILL-5879: Improved SQL Pattern Contains Performance

----


> Optimize "Like" operator
> ------------------------
>
>                 Key: DRILL-5879
>                 URL: https://issues.apache.org/jira/browse/DRILL-5879
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>         Environment: * 
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Minor
>             Fix For: 1.13.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like 
> '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're 
> dealing with the same column twice
> # Optimize the "contains" for-loop 
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being 
> in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all 
> evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and 
> will be reused across other LIKE evaluations
> 2) 
> * The "Contains" LIKE operation is quite expensive as the code needs to 
> access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to 
> optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment 
> not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes    : 27sec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to