[jira] [Commented] (NIFI-1118) Enable SplitText processor to limit line length and filter header lines

ASF GitHub Bot (JIRA) Wed, 27 Jan 2016 06:38:49 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119319#comment-15119319
 ]


ASF GitHub Bot commented on NIFI-1118:
--------------------------------------

Github user markobean commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/135#discussion_r50990975
  
    --- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
    @@ -143,26 +165,12 @@ protected void init(final 
ProcessorInitializationContext context) {
             return properties;
         }
     
    -    private int readLines(final InputStream in, final int maxNumLines, 
final OutputStream out, final boolean keepAllNewLines) throws IOException {
    -        int numLines = 0;
    -        for (int i = 0; i < maxNumLines; i++) {
    -            final long bytes = countBytesToSplitPoint(in, out, 
keepAllNewLines || (i != maxNumLines - 1));
    -            if (bytes <= 0) {
    -                return numLines;
    -            }
    -
    -            numLines++;
    -        }
    -
    -        return numLines;
    -    }
    -
    -    private long countBytesToSplitPoint(final InputStream in, final 
OutputStream out, final boolean includeLineDelimiter) throws IOException {
    +    private int readLine(final InputStream in, final OutputStream out,
    +                          final boolean includeLineDelimiter) throws 
IOException {
             int lastByte = -1;
    -        long bytesRead = 0L;
    +        int bytesRead = 0;
     
             while (true) {
    -            in.mark(1);
    --- End diff --
    
    This "in.mark(1)" was removed because marking and resetting the input 
stream has changed. Previously, this mark/reset was used to rollback the 
reading of the character after a \r. (More below.) Now, mark/reset is used at a 
higher level to potentially rollback an entire line read from the input 
flowfile if that line exceeds the size limit imposed by the FRAGMENT_MAX_SIZE 
property.
    Removing this in.mark() does not generate an IOException in the below (line 
206) in.reset() because an in.mark() was previously made prior to the call to 
readLine() (line 239 or line 323.) Nonetheless, the overall logic is still 
incorrect. There are two logical in.reset() conditions: character-based and 
line-based. This must be corrected and made consistent (line-based).
    The special consideration of the \r character confuses me ('if' block 
beginning on Line 205.) In the original code, the byte after the \r is rolled 
back. Why? If the byte after \r is \n, the special consideration for \r is not 
required as the 'if' on line 194 captures the end of line. Is it valid and 
intended to have the \r indicate an end of line even when a subsequent \n is 
not present? (Note: testing of present SplitText processor shows incorrect 
behavior for only \r without \n; it duplicates the first character of the next 
"line".)
    If special consideration for \r is not required and the 'if' block 
beginning on line 205 is removed, both the in.mark and in.reset within the 
readLine() method go away, and I believe all will be correct.


> Enable SplitText processor to limit line length and filter header lines
> -----------------------------------------------------------------------
>
>                 Key: NIFI-1118
>                 URL: https://issues.apache.org/jira/browse/NIFI-1118
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Bean
>            Assignee: Joe Skora
>             Fix For: 0.6.0
>
>
> Include the following functionality to the SplitText processor:
> 1) Maximum size limit of the split file(s)
> A new split file will be created if the next line to be added to the current 
> split file exceeds a user-defined maximum file size
> 2) Header line marker
> User-defined character(s) can be used to identify the header line(s) of the 
> data file rather than a predetermined number of lines
> These changes are additions, not a replacement of any property or behavior. 
> In the case of header line marker, the existing property "Header Line Count" 
> must be zero for the new property and behavior to be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-1118) Enable SplitText processor to limit line length and filter header lines

Reply via email to