[ 
https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210355#comment-16210355
 ] 

ASF GitHub Bot commented on DRILL-5879:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1001
  
    @sachouche, thanks for the first PR to Drill! Thanks for the detailed 
explanation!
    
    Before reviewing the code, a comment on the design:
    
    > Added a new integer variable "asciiMode" ... this value will be set ... 
during the first LIKE evaluation and will be reused across other LIKE 
evaluations
    
    The problem with this design is that there is no guarantee that the first 
value is representative of the other columns. Maybe my list looks like this:
    
    ```
    Hello
    你好
    ```
    
    The first value is ASCII. The second is not. So, we must treat each value 
as independent of the others.
    
    On the other hand, we *can* exploit the nature of UTF-8. The encoding is 
such that no valid UTF-8 character is a prefix of any other valid character. 
Thus, if a character is 0xXX 0xYY 0xZZ, then there can *never* be a valid 
character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and 
contains can be done without either converting to UTF-16 or even caring if the 
data is ASCII or not.
    
    What does this mean? It means that, for the simple operations:
    
    1. Convert the Java UTF-16 string to UTF-8.
    2. Do the classic byte comparison methods for starts with, ends with or 
contains. No special processing is needed for multi-byte
    
    Unlike other multi-byte encodings, UTF-8 was designed to make this possible.
    
    If we go this route, we would not need the ASCII mode flag.
    
    Note: all of this applies only to the "basic four" operations: if we do a 
real regex, then we must decode the Varchar into a Java UTF-16 string.


> Optimize "Like" operator
> ------------------------
>
>                 Key: DRILL-5879
>                 URL: https://issues.apache.org/jira/browse/DRILL-5879
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>         Environment: * 
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Minor
>             Fix For: 1.12.0
>
>
> Query: select <column-list> from <table> where colA like '%a%' or colA like 
> '%xyz%';
> Improvement Opportunities
> # Avoid isAscii computation (full access of the input string) since we're 
> dealing with the same column twice
> # Optimize the "contains" for-loop 
> Implementation Details
> 1)
> * Added a new integer variable "asciiMode" to the VarCharHolder class
> * The default value is -1 which indicates this info is not known
> * Otherwise this value will be set to either 1 or 0 based on the string being 
> in ASCII mode or Unicode
> * The execution plan already shares the same VarCharHolder instance for all 
> evaluations of the same column value
> * The asciiMode will be correctly set during the first LIKE evaluation and 
> will be reused across other LIKE evaluations
> 2) 
> * The "Contains" LIKE operation is quite expensive as the code needs to 
> access the input string to perform character based comparisons
> * Created 4 versions of the same for-loop to a) make the loop simpler to 
> optimize (Vectorization) and b) minimize comparisons
> Benchmarks
> * Lineitem table 100GB
> * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment 
> not like '%a%' or l_comment like '%the%' group by l_returnflag
> * Before changes: 33sec
> * After changes    : 27sec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to