[ 
https://issues.apache.org/jira/browse/NIFI-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882759#comment-15882759
 ] 

ASF GitHub Bot commented on NIFI-2876:
--------------------------------------

Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/1214#discussion_r102947724
  
    --- Diff: 
nifi-commons/nifi-utils/src/main/java/org/apache/nifi/stream/io/util/TextLineDemarcator.java
 ---
    @@ -90,58 +70,73 @@ public OffsetInfo nextOffsetInfo() {
          * by either '\r', '\n' or '\r\n'). <br>
          * The <i>offset info</i> computed and returned as {@link OffsetInfo} 
where
          * {@link OffsetInfo#isStartsWithMatch()} will return true if
    -     * <code>startsWith</code> was successfully matched with the stsarting 
bytes
    +     * <code>startsWith</code> was successfully matched with the starting 
bytes
          * of the text line.
          *
    +     * NOTE: The reason for 2 'nextOffsetInfo(..)' operations is that the
    +     * 'startsWith' argument will force the actual token to be extracted 
and
    +     * then matched introducing the overhead for System.arrayCopy and 
matching
    +     * logic which is an optional scenario and is avoided all together if
    +     * 'startsWith' is not provided (i.e., null).
    +     *
          * @return offset info
          */
    -    public OffsetInfo nextOffsetInfo(byte[] startsWith) {
    +    public OffsetInfo nextOffsetInfo(byte[] startsWith) throws IOException 
{
             OffsetInfo offsetInfo = null;
    -        int lineLength = 0;
    -        byte[] token = null;
    -        lineLoop:
    -        while (this.bufferLength != -1) {
    -            if (this.index >= this.bufferLength) {
    +        byte previousByteVal = 0;
    +        byte[] data = null;
    +        nextTokenLoop:
    +        while (data == null && this.availableBytesLength != -1) {
    +            if (this.index >= this.availableBytesLength) {
                     this.fill();
                 }
    -            if (this.bufferLength != -1) {
    -                int i;
    +            int delimiterSize = 0;
    +            if (this.availableBytesLength != -1) {
                     byte byteVal;
    -                for (i = this.index; i < this.bufferLength; i++) {
    +                int i;
    +                for (i = this.index; i < this.availableBytesLength; i++) {
                         byteVal = this.buffer[i];
    -                    lineLength++;
    -                    int crlfLength = computeEol(byteVal, i + 1);
    -                    if (crlfLength > 0) {
    -                        i += crlfLength;
    -                        if (crlfLength == 2) {
    -                            lineLength++;
    -                        }
    -                        offsetInfo = new OffsetInfo(this.offset, 
lineLength, crlfLength);
    +
    +                    if (byteVal == LF) {
    +                        delimiterSize = previousByteVal == CR ? 2 : 1;
    +                    } else if (previousByteVal == CR) {
    +                        delimiterSize = 1;
    +                        i--;
    +                    }
    +                    previousByteVal = byteVal;
    +                    if (delimiterSize > 0) {
    +                        this.index = i + 1;
    +                        int size = Math.max(1, this.index - this.mark);
    +                        offsetInfo = new OffsetInfo(this.offset, size, 
delimiterSize);
    +                        this.offset += size;
                             if (startsWith != null) {
    -                            token = this.extractDataToken(lineLength);
    +                            data = this.extractDataToken(size);
                             }
                             this.mark = this.index;
    -                        break lineLoop;
    +                        break nextTokenLoop;
                         }
                     }
                     this.index = i;
    +            } else {
    +                delimiterSize = previousByteVal == CR || previousByteVal 
== LF ? 1 : 0;
    +                if (offsetInfo == null) {
    +                    int size = this.index - this.mark;
    +                    if (size > 0) {
    +                        offsetInfo = new OffsetInfo(this.offset, size, 
delimiterSize);
    +                        this.offset += size;
    +                    }
    +                }
    +                if (startsWith != null) {
    +                    data = this.extractDataToken(this.index - this.mark);
    +                }
                 }
             }
    -        // EOF where last char(s) are not CRLF.
    -        if (lineLength > 0 && offsetInfo == null) {
    -            offsetInfo = new OffsetInfo(this.offset, lineLength, 0);
    -            if (startsWith != null) {
    -                token = this.extractDataToken(lineLength);
    -            }
    -        }
    -        this.offset += lineLength;
     
    -        // checks if the new line starts with 'startsWith' chars
    -        if (startsWith != null) {
    +        if (startsWith != null && data != null) {
                 for (int i = 0; i < startsWith.length; i++) {
                     byte sB = startsWith[i];
    -                if (token != null && sB != token[i]) {
    -                    offsetInfo.setStartsWithMatch(0);
    +                if (sB != data[i]) {
    --- End diff --
    
    There is a potential `ArrayIndexOutOfBoundsException` here. If `startsWith` 
is longer than the last token that is extracted and the token does match the 
first `data.length` bytes of the `startsWith` array, then it will throw an 
`ArrayIndexOutOfBoundsException`. The following unit test will expose that:
    
    
    ```
    @Test
        public void testStartsWithLongerThanLastToken() throws IOException {
            final byte[] inputData = "This is going to be a spectacular 
test\nThis is".getBytes(StandardCharsets.UTF_8);
            final byte[] startsWith = "This is going to 
be".getBytes(StandardCharsets.UTF_8);
    
            try (final InputStream is = new ByteArrayInputStream(inputData);
                final TextLineDemarcator demarcator = new 
TextLineDemarcator(is)) {
    
                final OffsetInfo first = demarcator.nextOffsetInfo(startsWith);
                assertNotNull(first);
                assertEquals(0, first.getStartOffset());
                assertEquals(39, first.getLength());
                assertEquals(1, first.getCrlfLength());
                assertTrue(first.isStartsWithMatch());
    
                final OffsetInfo second = demarcator.nextOffsetInfo(startsWith);
                assertNotNull(second);
                assertEquals(39, second.getStartOffset());
                assertEquals(7, second.getLength());
                assertEquals(0, second.getCrlfLength());
                assertFalse(second.isStartsWithMatch());
            }
        }
    
    ```


> Refactor TextLineDemarcator and StreamDemarcator into a common abstract class
> -----------------------------------------------------------------------------
>
>                 Key: NIFI-2876
>                 URL: https://issues.apache.org/jira/browse/NIFI-2876
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Oleg Zhurakousky
>            Assignee: Oleg Zhurakousky
>            Priority: Minor
>             Fix For: 1.2.0
>
>
> Based on the work that has been performed as part of the NIFI-2851 we now 
> have a new class with a significantly faster logic to perform demarcation of 
> the InputStream (TextLineDemarcator). This new class's initial starting point 
> was the existing LineDemarcator. They both now share ~60-70% of common code 
> which would be important to extract into a common abstract class as well as 
> incorporate the new (faster) demarcation logic int StreamDemarcator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to