[ 
https://issues.apache.org/jira/browse/DRILL-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317572#comment-15317572
 ] 

ASF GitHub Bot commented on DRILL-4573:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/512#discussion_r65994494
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/CharSequenceWrapper.java
 ---
    @@ -17,13 +17,52 @@
      */
     package org.apache.drill.exec.expr.fn.impl;
     
    +import java.nio.ByteBuffer;
    +import java.nio.CharBuffer;
    +import java.nio.charset.CharacterCodingException;
    +import java.nio.charset.Charset;
    +import java.nio.charset.CharsetDecoder;
    +import java.nio.charset.CoderResult;
    +import java.util.regex.Matcher;
    +
     import io.netty.buffer.DrillBuf;
     
    +/**
    + * A CharSequence is a readable sequence of char values. This interface 
provides
    + * uniform, read-only access to many different kinds of char sequences. A 
char
    + * value represents a character in the Basic Multilingual Plane (BMP) or a
    + * surrogate. Refer to Unicode Character Representation for details.<br>
    + * Specifically this implementation of the CharSequence adapts a Drill
    + * {@link DrillBuf} to the CharSequence. The implementation is meant to be
    + * re-used that is allocated once and then passed DrillBuf to adapt. This 
can be
    + * handy to exploit API that consume CharSequence avoiding the need to 
create
    + * string objects.
    + *
    + */
     public class CharSequenceWrapper implements CharSequence {
     
    +    // The adapted drill buffer (in the case of US-ASCII)
    +    private DrillBuf buffer;
    +    // The converted bytes in the case of non ASCII
    +    private CharBuffer charBuffer;
    --- End diff --
    
    In the case of non ASCII, did we see any improvement compared to the 
approach before DRILL-4573? If there is no noticeable improvement,  can we 
switch to the original call?
    
    String i = 
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(input.start,
 input.end, input.buffer);
    
    Seems to me this CharBuffer is doing exactly the same job as this function 
: encode the byte arrays in UTF8. Having to init / re-allocate makes thing 
complicated; it could introduce bugs here and there. The original method calls 
java library method, which is assumed to be well tested.
    
    Can we 1) use CharSequenceWrapper only for ASCII, 2) construct String for 
non-ASCII in the body of regex function (regex_matches)?
     


> Zero copy LIKE, REGEXP_MATCHES, SUBSTR
> --------------------------------------
>
>                 Key: DRILL-4573
>                 URL: https://issues.apache.org/jira/browse/DRILL-4573
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: jean-claude
>            Priority: Critical
>             Fix For: 1.7.0
>
>         Attachments: DRILL-4573-3.patch.txt, DRILL-4573.patch.txt
>
>
> All the functions using the java.util.regex.Matcher are currently creating 
> Java string objects to pass into the matcher.reset().
> However this creates unnecessary copy of the bytes and a Java string object.
> The matcher uses a CharSequence, so instead of making a copy we can create an 
> adapter from the DrillBuffer to the CharSequence interface.
> Gains of 25% in execution speed are possible when going over VARCHAR of 36 
> chars. The gain will be proportional to the size of the VARCHAR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to