[
https://issues.apache.org/jira/browse/NIFI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119319#comment-15119319
]
ASF GitHub Bot commented on NIFI-1118:
--------------------------------------
Github user markobean commented on a diff in the pull request:
https://github.com/apache/nifi/pull/135#discussion_r50990975
--- Diff:
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
---
@@ -143,26 +165,12 @@ protected void init(final
ProcessorInitializationContext context) {
return properties;
}
- private int readLines(final InputStream in, final int maxNumLines,
final OutputStream out, final boolean keepAllNewLines) throws IOException {
- int numLines = 0;
- for (int i = 0; i < maxNumLines; i++) {
- final long bytes = countBytesToSplitPoint(in, out,
keepAllNewLines || (i != maxNumLines - 1));
- if (bytes <= 0) {
- return numLines;
- }
-
- numLines++;
- }
-
- return numLines;
- }
-
- private long countBytesToSplitPoint(final InputStream in, final
OutputStream out, final boolean includeLineDelimiter) throws IOException {
+ private int readLine(final InputStream in, final OutputStream out,
+ final boolean includeLineDelimiter) throws
IOException {
int lastByte = -1;
- long bytesRead = 0L;
+ int bytesRead = 0;
while (true) {
- in.mark(1);
--- End diff --
This "in.mark(1)" was removed because marking and resetting the input
stream has changed. Previously, this mark/reset was used to rollback the
reading of the character after a \r. (More below.) Now, mark/reset is used at a
higher level to potentially rollback an entire line read from the input
flowfile if that line exceeds the size limit imposed by the FRAGMENT_MAX_SIZE
property.
Removing this in.mark() does not generate an IOException in the below (line
206) in.reset() because an in.mark() was previously made prior to the call to
readLine() (line 239 or line 323.) Nonetheless, the overall logic is still
incorrect. There are two logical in.reset() conditions: character-based and
line-based. This must be corrected and made consistent (line-based).
The special consideration of the \r character confuses me ('if' block
beginning on Line 205.) In the original code, the byte after the \r is rolled
back. Why? If the byte after \r is \n, the special consideration for \r is not
required as the 'if' on line 194 captures the end of line. Is it valid and
intended to have the \r indicate an end of line even when a subsequent \n is
not present? (Note: testing of present SplitText processor shows incorrect
behavior for only \r without \n; it duplicates the first character of the next
"line".)
If special consideration for \r is not required and the 'if' block
beginning on line 205 is removed, both the in.mark and in.reset within the
readLine() method go away, and I believe all will be correct.
> Enable SplitText processor to limit line length and filter header lines
> -----------------------------------------------------------------------
>
> Key: NIFI-1118
> URL: https://issues.apache.org/jira/browse/NIFI-1118
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Bean
> Assignee: Joe Skora
> Fix For: 0.6.0
>
>
> Include the following functionality to the SplitText processor:
> 1) Maximum size limit of the split file(s)
> A new split file will be created if the next line to be added to the current
> split file exceeds a user-defined maximum file size
> 2) Header line marker
> User-defined character(s) can be used to identify the header line(s) of the
> data file rather than a predetermined number of lines
> These changes are additions, not a replacement of any property or behavior.
> In the case of header line marker, the existing property "Header Line Count"
> must be zero for the new property and behavior to be used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)