[ 
https://issues.apache.org/jira/browse/NIFI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187800#comment-15187800
 ] 

Mark Bean commented on NIFI-1118:
---------------------------------

What is the intent of the 'Remove Trailing Newlines' property? I believe the 
intent is to remove the End Of Line (EOL) character from the last line of each 
split file along with any additional lines that consist of nothing other than 
the EOL character (i.e. blank lines.) It seems to work fine when there is data 
other than blank lines. However, blank lines result in odd behavior. For 
example, I have observed the second split file having only 2 (blank) lines in a 
case where Header Line Count = 0, Line Split Count = 3, Remove Trailing 
Newlines = true, and the input file has lines 4-9 consisting of only '\n'. 
Essentially, only the last line of the split has its EOL removed.

Even more concerning is the case when Header Line Count is specified (and 
therefore all lines are written to an output stream versus simply cloning 
segments of the input flowfile.) Here, when a split file consists of nothing 
but blank lines, not only is that split file not output, but no subsequent 
split files are generated. The splitting is effectively stopped because 
processing believes the empty split file is the result of End Of File. This is 
a bug.

This can be addressed in the redesign of the SplitText processor. However, 
"proper" behavior needs to be well-defined. Additionally, I strongly recommend 
that the last line of the split file contain the exact contents as the line 
from the original flowfile. In other words, keep the EOL character. Removing it 
becomes highly problematic when splitting on maximum size. In such cases, you 
never know you're on the last line of the split file until the next line is 
read (and exceeds the limit.) Further, the behavior of a split file consisting 
of only blank lines (when Remove Trailing Newlines is true) needs to be clearly 
defined. 

Suggestions: include EOL for all lines, but only remove trailing blank lines. 
Further, in cases where Remove Trailing Newlines is true and a split consists 
of only newlines, the split should consist of a single blank line.

Please comment.

> Enable SplitText processor to limit line length and filter header lines
> -----------------------------------------------------------------------
>
>                 Key: NIFI-1118
>                 URL: https://issues.apache.org/jira/browse/NIFI-1118
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Bean
>            Assignee: Joe Skora
>             Fix For: 0.6.0
>
>
> Include the following functionality to the SplitText processor:
> 1) Maximum size limit of the split file(s)
> A new split file will be created if the next line to be added to the current 
> split file exceeds a user-defined maximum file size
> 2) Header line marker
> User-defined character(s) can be used to identify the header line(s) of the 
> data file rather than a predetermined number of lines
> These changes are additions, not a replacement of any property or behavior. 
> In the case of header line marker, the existing property "Header Line Count" 
> must be zero for the new property and behavior to be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to