[jira] [Commented] (DRILL-5879) Optimize "Like" operator

ASF GitHub Bot (JIRA) Wed, 18 Oct 2017 17:46:50 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210400#comment-16210400
 ]


ASF GitHub Bot commented on DRILL-5879:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1001#discussion_r145577894
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
    @@ -17,36 +17,133 @@
      */
     package org.apache.drill.exec.expr.fn.impl;
     
    -public class SqlPatternContainsMatcher implements SqlPatternMatcher {
    +public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
       final String patternString;
       CharSequence charSequenceWrapper;
       final int patternLength;
     
       public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
    -    this.patternString = patternString;
    +    this.patternString       = patternString;
         this.charSequenceWrapper = charSequenceWrapper;
    -    patternLength = patternString.length();
    +    patternLength            = patternString.length();
       }
     
       @Override
    -  public int match() {
    -    final int txtLength = charSequenceWrapper.length();
    -    int patternIndex = 0;
    -    int txtIndex = 0;
    +  public final int match() {
    +    // The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
    +    // the generate code.
    +    if (patternLength == 1) {
    +      return match_1();
    +    } else if (patternLength == 2) {
    +      return match_2();
    +    } else if (patternLength == 3) {
    +      return match_3();
    +    } else {
    +      return match_N();
    +    }
    +  }
    +
    +  private final int match_1() {
    +    final CharSequence sequenceWrapper = charSequenceWrapper;
    +    final int lengthToProcess          = sequenceWrapper.length();
    +    final char first_patt_char         = patternString.charAt(0);
    +
    +    // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
    +    // so, we can just directly compare.
    +    for (int idx = 0; idx < lengthToProcess; idx++) {
    +      char input_char = sequenceWrapper.charAt(idx);
    +
    +      if (first_patt_char != input_char) {
    +        continue;
    +      }
    +      return 1;
    +    }
    +    return 0;
    +  }
    +
    +  private final int match_2() {
    +    final CharSequence sequenceWrapper = charSequenceWrapper;
    +    final int lengthToProcess          = sequenceWrapper.length() - 1;
    +    final char first_patt_char         = patternString.charAt(0);
    +
    +    // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
    +    // so, we can just directly compare.
    +    for (int idx = 0; idx < lengthToProcess; idx++) {
    +      char input_char = sequenceWrapper.charAt(idx);
    +
    +      if (first_patt_char != input_char) {
    +        continue;
    +      } else {
    +        char ch2_1 = sequenceWrapper.charAt(idx+1);
    +        char ch2_2 = patternString.charAt(1);
    +
    +        if (ch2_1 == ch2_2) {
    +          return 1;
    +        }
    +      }
    +    }
    +    return 0;
    +  }
    +
    +  private final int match_3() {
    +    final CharSequence sequenceWrapper = charSequenceWrapper;
    +    final int lengthToProcess          = sequenceWrapper.length() -2;
    +    final char first_patt_char         = patternString.charAt(0);
     
         // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
         // so, we can just directly compare.
    -    while (patternIndex < patternLength && txtIndex < txtLength) {
    -      if (patternString.charAt(patternIndex) != 
charSequenceWrapper.charAt(txtIndex)) {
    -        // Go back if there is no match
    -        txtIndex = txtIndex - patternIndex;
    -        patternIndex = 0;
    +    for (int idx = 0; idx < lengthToProcess; idx++) {
    +      char input_char = sequenceWrapper.charAt(idx);
    +
    +      if (first_patt_char != input_char) {
    +        continue;
           } else {
    -        patternIndex++;
    +        char ch2_1 = sequenceWrapper.charAt(idx+1);
    +        char ch2_2 = patternString.charAt(1);
    +        char ch3_1 = sequenceWrapper.charAt(idx+2);
    +        char ch3_2 = patternString.charAt(2);
    +
    +        if (ch2_1 == ch2_2 && ch3_1 == ch3_2) {
    +          return 1;
    +        }
           }
    -      txtIndex++;
    +    }
    +    return 0;
    +  }
    +
    +  private final int match_N() {
    +
    +    if (patternLength == 0) {
    --- End diff --
    
    Can't this be optimized away in the pattern parser stage? We should not be 
calling this function if we had a zero-length pattern.


> Optimize "Like" operator
> ------------------------
>
>                 Key: DRILL-5879
>                 URL: https://issues.apache.org/jira/browse/DRILL-5879
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>         Environment: * 
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Minor
>             Fix For: 1.12.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like 
> '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're 
> dealing with the same column twice
> # Optimize the "contains" for-loop 
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being 
> in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all 
> evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and 
> will be reused across other LIKE evaluations
> 2) 
> * The "Contains" LIKE operation is quite expensive as the code needs to 
> access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to 
> optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment 
> not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes    : 27sec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5879) Optimize "Like" operator

Reply via email to