[
https://issues.apache.org/jira/browse/NIFI-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882759#comment-15882759
]
ASF GitHub Bot commented on NIFI-2876:
--------------------------------------
Github user markap14 commented on a diff in the pull request:
https://github.com/apache/nifi/pull/1214#discussion_r102947724
--- Diff:
nifi-commons/nifi-utils/src/main/java/org/apache/nifi/stream/io/util/TextLineDemarcator.java
---
@@ -90,58 +70,73 @@ public OffsetInfo nextOffsetInfo() {
* by either '\r', '\n' or '\r\n'). <br>
* The <i>offset info</i> computed and returned as {@link OffsetInfo}
where
* {@link OffsetInfo#isStartsWithMatch()} will return true if
- * <code>startsWith</code> was successfully matched with the stsarting
bytes
+ * <code>startsWith</code> was successfully matched with the starting
bytes
* of the text line.
*
+ * NOTE: The reason for 2 'nextOffsetInfo(..)' operations is that the
+ * 'startsWith' argument will force the actual token to be extracted
and
+ * then matched introducing the overhead for System.arrayCopy and
matching
+ * logic which is an optional scenario and is avoided all together if
+ * 'startsWith' is not provided (i.e., null).
+ *
* @return offset info
*/
- public OffsetInfo nextOffsetInfo(byte[] startsWith) {
+ public OffsetInfo nextOffsetInfo(byte[] startsWith) throws IOException
{
OffsetInfo offsetInfo = null;
- int lineLength = 0;
- byte[] token = null;
- lineLoop:
- while (this.bufferLength != -1) {
- if (this.index >= this.bufferLength) {
+ byte previousByteVal = 0;
+ byte[] data = null;
+ nextTokenLoop:
+ while (data == null && this.availableBytesLength != -1) {
+ if (this.index >= this.availableBytesLength) {
this.fill();
}
- if (this.bufferLength != -1) {
- int i;
+ int delimiterSize = 0;
+ if (this.availableBytesLength != -1) {
byte byteVal;
- for (i = this.index; i < this.bufferLength; i++) {
+ int i;
+ for (i = this.index; i < this.availableBytesLength; i++) {
byteVal = this.buffer[i];
- lineLength++;
- int crlfLength = computeEol(byteVal, i + 1);
- if (crlfLength > 0) {
- i += crlfLength;
- if (crlfLength == 2) {
- lineLength++;
- }
- offsetInfo = new OffsetInfo(this.offset,
lineLength, crlfLength);
+
+ if (byteVal == LF) {
+ delimiterSize = previousByteVal == CR ? 2 : 1;
+ } else if (previousByteVal == CR) {
+ delimiterSize = 1;
+ i--;
+ }
+ previousByteVal = byteVal;
+ if (delimiterSize > 0) {
+ this.index = i + 1;
+ int size = Math.max(1, this.index - this.mark);
+ offsetInfo = new OffsetInfo(this.offset, size,
delimiterSize);
+ this.offset += size;
if (startsWith != null) {
- token = this.extractDataToken(lineLength);
+ data = this.extractDataToken(size);
}
this.mark = this.index;
- break lineLoop;
+ break nextTokenLoop;
}
}
this.index = i;
+ } else {
+ delimiterSize = previousByteVal == CR || previousByteVal
== LF ? 1 : 0;
+ if (offsetInfo == null) {
+ int size = this.index - this.mark;
+ if (size > 0) {
+ offsetInfo = new OffsetInfo(this.offset, size,
delimiterSize);
+ this.offset += size;
+ }
+ }
+ if (startsWith != null) {
+ data = this.extractDataToken(this.index - this.mark);
+ }
}
}
- // EOF where last char(s) are not CRLF.
- if (lineLength > 0 && offsetInfo == null) {
- offsetInfo = new OffsetInfo(this.offset, lineLength, 0);
- if (startsWith != null) {
- token = this.extractDataToken(lineLength);
- }
- }
- this.offset += lineLength;
- // checks if the new line starts with 'startsWith' chars
- if (startsWith != null) {
+ if (startsWith != null && data != null) {
for (int i = 0; i < startsWith.length; i++) {
byte sB = startsWith[i];
- if (token != null && sB != token[i]) {
- offsetInfo.setStartsWithMatch(0);
+ if (sB != data[i]) {
--- End diff --
There is a potential `ArrayIndexOutOfBoundsException` here. If `startsWith`
is longer than the last token that is extracted and the token does match the
first `data.length` bytes of the `startsWith` array, then it will throw an
`ArrayIndexOutOfBoundsException`. The following unit test will expose that:
```
@Test
public void testStartsWithLongerThanLastToken() throws IOException {
final byte[] inputData = "This is going to be a spectacular
test\nThis is".getBytes(StandardCharsets.UTF_8);
final byte[] startsWith = "This is going to
be".getBytes(StandardCharsets.UTF_8);
try (final InputStream is = new ByteArrayInputStream(inputData);
final TextLineDemarcator demarcator = new
TextLineDemarcator(is)) {
final OffsetInfo first = demarcator.nextOffsetInfo(startsWith);
assertNotNull(first);
assertEquals(0, first.getStartOffset());
assertEquals(39, first.getLength());
assertEquals(1, first.getCrlfLength());
assertTrue(first.isStartsWithMatch());
final OffsetInfo second = demarcator.nextOffsetInfo(startsWith);
assertNotNull(second);
assertEquals(39, second.getStartOffset());
assertEquals(7, second.getLength());
assertEquals(0, second.getCrlfLength());
assertFalse(second.isStartsWithMatch());
}
}
```
> Refactor TextLineDemarcator and StreamDemarcator into a common abstract class
> -----------------------------------------------------------------------------
>
> Key: NIFI-2876
> URL: https://issues.apache.org/jira/browse/NIFI-2876
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Oleg Zhurakousky
> Assignee: Oleg Zhurakousky
> Priority: Minor
> Fix For: 1.2.0
>
>
> Based on the work that has been performed as part of the NIFI-2851 we now
> have a new class with a significantly faster logic to perform demarcation of
> the InputStream (TextLineDemarcator). This new class's initial starting point
> was the existing LineDemarcator. They both now share ~60-70% of common code
> which would be important to extract into a common abstract class as well as
> incorporate the new (faster) demarcation logic int StreamDemarcator.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)