[ 
https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210398#comment-16210398
 ] 

ASF GitHub Bot commented on DRILL-5879:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1001#discussion_r145577808
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
    @@ -17,36 +17,133 @@
      */
     package org.apache.drill.exec.expr.fn.impl;
     
    -public class SqlPatternContainsMatcher implements SqlPatternMatcher {
    +public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
       final String patternString;
       CharSequence charSequenceWrapper;
       final int patternLength;
     
       public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
    -    this.patternString = patternString;
    +    this.patternString       = patternString;
         this.charSequenceWrapper = charSequenceWrapper;
    -    patternLength = patternString.length();
    +    patternLength            = patternString.length();
       }
     
       @Override
    -  public int match() {
    -    final int txtLength = charSequenceWrapper.length();
    -    int patternIndex = 0;
    -    int txtIndex = 0;
    +  public final int match() {
    +    // The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
    +    // the generate code.
    +    if (patternLength == 1) {
    +      return match_1();
    +    } else if (patternLength == 2) {
    +      return match_2();
    +    } else if (patternLength == 3) {
    +      return match_3();
    +    } else {
    +      return match_N();
    +    }
    +  }
    +
    +  private final int match_1() {
    +    final CharSequence sequenceWrapper = charSequenceWrapper;
    +    final int lengthToProcess          = sequenceWrapper.length();
    +    final char first_patt_char         = patternString.charAt(0);
    +
    +    // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
    +    // so, we can just directly compare.
    +    for (int idx = 0; idx < lengthToProcess; idx++) {
    +      char input_char = sequenceWrapper.charAt(idx);
    +
    +      if (first_patt_char != input_char) {
    +        continue;
    +      }
    +      return 1;
    +    }
    +    return 0;
    +  }
    +
    +  private final int match_2() {
    +    final CharSequence sequenceWrapper = charSequenceWrapper;
    +    final int lengthToProcess          = sequenceWrapper.length() - 1;
    +    final char first_patt_char         = patternString.charAt(0);
    +
    +    // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
    +    // so, we can just directly compare.
    +    for (int idx = 0; idx < lengthToProcess; idx++) {
    +      char input_char = sequenceWrapper.charAt(idx);
    +
    +      if (first_patt_char != input_char) {
    +        continue;
    +      } else {
    +        char ch2_1 = sequenceWrapper.charAt(idx+1);
    +        char ch2_2 = patternString.charAt(1);
    --- End diff --
    
    We want speed. Instead of getting the second character multiple times, is 
it better to get it once up front? I suppose that depends on the hit rate. 
Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 
characters, we'll have, on average one hit so we pay the second character cost 
once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM 
can optimize away the second and subsequent accesses?
    
    Actually, let's take a step back. The pattern is fixed. We parsed the 
pattern to decide to use this particular class. Should we instead create a 
1-char, 2-char and n-char matcher class so we get the second character (for the 
2-char case) only once, and we eliminate the extra per-value if-check?


> Optimize "Like" operator
> ------------------------
>
>                 Key: DRILL-5879
>                 URL: https://issues.apache.org/jira/browse/DRILL-5879
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>         Environment: * 
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Minor
>             Fix For: 1.12.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like 
> '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're 
> dealing with the same column twice
> # Optimize the "contains" for-loop 
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being 
> in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all 
> evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and 
> will be reused across other LIKE evaluations
> 2) 
> * The "Contains" LIKE operation is quite expensive as the code needs to 
> access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to 
> optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment 
> not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes    : 27sec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to