[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user joewitt commented on the pull request: https://github.com/apache/nifi/pull/280#issuecomment-211956655 thanks @markobean and @jskora . What do you think about making ignore newlines only be honored/supported when not using the new features you're planning to include or only in very specific configurations? I ask because this, admittedly mistaken, feature is used a lot. Ultimately if that seems to unwieldy we can punt that feature in 1.0, add your new capabilities, and support end of line removal on ReplaceText instead. We just need to remember to document this in the migration guide for 1.0 as this could cause some pretty funky behavior changes for folks. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markobean commented on the pull request: https://github.com/apache/nifi/pull/280#issuecomment-211915220 @joewitt Talked with @jskora and concur this PR can be closed. A new PR will be opened later with the new features and also with the Return Trailing Newlines bug fix included (if RTN is still included.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user joewitt commented on the pull request: https://github.com/apache/nifi/pull/280#issuecomment-211726456 @jskora do you think this PR can be closed now given the updates made to fix the underlying defects found? A new PR could be submitted which adds the proposed features or goes into ReplaceText or a new processor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56329501 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -143,72 +199,82 @@ protected void init(final ProcessorInitializationContext context) { return properties; } -private int readLines(final InputStream in, final int maxNumLines, final OutputStream out, final boolean keepAllNewLines) throws IOException { +private int readLines(final InputStream in, final int maxNumLines, final long maxByteCount, final OutputStream out) throws IOException { int numLines = 0; +long totalBytes = 0L; for (int i = 0; i < maxNumLines; i++) { -final long bytes = countBytesToSplitPoint(in, out, keepAllNewLines || (i != maxNumLines - 1)); +final long bytes = countBytesToSplitPoint(in, out, totalBytes, maxByteCount); +totalBytes += bytes; if (bytes <= 0) { return numLines; } - numLines++; +if (totalBytes >= maxByteCount && numLines > maxNumLines) { +break; +} } - return numLines; } -private long countBytesToSplitPoint(final InputStream in, final OutputStream out, final boolean includeLineDelimiter) throws IOException { -int lastByte = -1; +private long countBytesToSplitPoint(final InputStream in, final OutputStream out, final long bytesReadSoFar, final long maxSize) throws IOException { long bytesRead = 0L; +final ByteArrayOutputStream buffer = new ByteArrayOutputStream(); --- End diff -- It looks like we always buffer this data into Java heap, but we don't actually use it for anything if out == null (which is the case more often than not, I believe). This is pretty concerning, as we should avoid buffering potentially large amounts of data into Java heap unless absolutely necessary. I think this needs to be reworked so that we don't buffer the data here, if out == null, since we won't make use of it anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user jskora commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56322564 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -134,6 +168,28 @@ protected void init(final ProcessorInitializationContext context) { } @Override +protected Collection customValidate(ValidationContext validationContext) { +ArrayList results = new ArrayList<>(); --- End diff -- I'm fixing these. Update will be pushed shortly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56327650 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -143,72 +199,82 @@ protected void init(final ProcessorInitializationContext context) { return properties; } -private int readLines(final InputStream in, final int maxNumLines, final OutputStream out, final boolean keepAllNewLines) throws IOException { +private int readLines(final InputStream in, final int maxNumLines, final long maxByteCount, final OutputStream out) throws IOException { int numLines = 0; +long totalBytes = 0L; for (int i = 0; i < maxNumLines; i++) { -final long bytes = countBytesToSplitPoint(in, out, keepAllNewLines || (i != maxNumLines - 1)); +final long bytes = countBytesToSplitPoint(in, out, totalBytes, maxByteCount); +totalBytes += bytes; if (bytes <= 0) { return numLines; } - numLines++; +if (totalBytes >= maxByteCount && numLines > maxNumLines) { --- End diff -- The logic here appears to be incorrect. numLines is incremented for each iteration of the loop (unless we return before it is incremented). This means that numLines <= i The loop's condition indicates i < maxNumLines So numLines <= i < maxNumLines So it is always the case that numLines < maxNumLines, so this condition will never be satisfied because numLines will never be > maxNumLines Now, looking through the code and doing a bit of testing, this does not appear to return an incorrect result, since countBytesToSplitPoint will handle the logic appropriately itself, but this should be fixed before it is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markobean commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56328470 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -143,72 +199,82 @@ protected void init(final ProcessorInitializationContext context) { return properties; } -private int readLines(final InputStream in, final int maxNumLines, final OutputStream out, final boolean keepAllNewLines) throws IOException { +private int readLines(final InputStream in, final int maxNumLines, final long maxByteCount, final OutputStream out) throws IOException { int numLines = 0; +long totalBytes = 0L; for (int i = 0; i < maxNumLines; i++) { -final long bytes = countBytesToSplitPoint(in, out, keepAllNewLines || (i != maxNumLines - 1)); +final long bytes = countBytesToSplitPoint(in, out, totalBytes, maxByteCount); +totalBytes += bytes; if (bytes <= 0) { return numLines; } - numLines++; +if (totalBytes >= maxByteCount && numLines > maxNumLines) { --- End diff -- Agreed. "&& numLines > maxNumLines" condition can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user jskora commented on the pull request: https://github.com/apache/nifi/pull/280#issuecomment-197388774 I think the issues are all covered now and tests have been expanded to get all critical logic and edge cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56234292 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -134,6 +168,28 @@ protected void init(final ProcessorInitializationContext context) { } @Override +protected Collection customValidate(ValidationContext validationContext) { +ArrayList results = new ArrayList<>(); + +results.add(new ValidationResult.Builder() +.subject("Remove Trailing Newlines") + .valid(!validationContext.getProperty(REMOVE_TRAILING_NEWLINES).asBoolean()) +.explanation("Property is deprecated; value must be set to false. Future releases may remove this Property.") +.build()); + +final boolean invalidState = (validationContext.getProperty(LINE_SPLIT_COUNT).asInteger() == 0 +&& validationContext.getProperty(FRAGMENT_MAX_SIZE).asDataSize(DataUnit.B) == null); --- End diff -- I think this line should be: && !validationContext.getProperty(FRAGMENT_MAX_SIZE).isSet() ); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56234376 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -134,6 +168,28 @@ protected void init(final ProcessorInitializationContext context) { } @Override +protected Collection customValidate(ValidationContext validationContext) { +ArrayList results = new ArrayList<>(); --- End diff -- Is there any reason we are declaring the type here as ArrayList, rather than List? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/280#discussion_r56233876 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -63,9 +71,16 @@ @SupportsBatching @Tags({"split", "text"}) @InputRequirement(Requirement.INPUT_REQUIRED) -@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines") +//@CapabilityDescription("Splits a text file into multiple smaller text files on line boundaries, each having up to a configured number of lines") --- End diff -- The commented-out line should probably be removed :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
GitHub user jskora opened a pull request: https://github.com/apache/nifi/pull/280 NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. * Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size. * Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used. * Deprecated the "Remove Trailing Newlines" property. * Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session. * Minor formatting cleanup. * Exclude test files from RAT check in pom.xml. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jskora/nifi NIFI-1118 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #280 commit 914c5a2c52b19d153077a89b59f751aa49ddf86c Author: Joe SkoraDate: 2016-03-15T14:31:00Z NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. * Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. In the case where an input line is longer than the fragment size, this line will be output in a separate split file that will exceed the maximum fragment size. * Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used. * Fixed conditional that incorrectly suppressed splits where the content line count equaled the header line count and did not remove empty splits from the session. * Minor formatting cleanup. * Exclude test files from RAT check in pom.xml. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user jskora commented on the pull request: https://github.com/apache/nifi/pull/135#issuecomment-196847105 Cancelling this pull request, will followup with new PR for updated and current source branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user jskora closed the pull request at: https://github.com/apache/nifi/pull/135 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markobean commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r50990975 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -143,26 +165,12 @@ protected void init(final ProcessorInitializationContext context) { return properties; } -private int readLines(final InputStream in, final int maxNumLines, final OutputStream out, final boolean keepAllNewLines) throws IOException { -int numLines = 0; -for (int i = 0; i < maxNumLines; i++) { -final long bytes = countBytesToSplitPoint(in, out, keepAllNewLines || (i != maxNumLines - 1)); -if (bytes <= 0) { -return numLines; -} - -numLines++; -} - -return numLines; -} - -private long countBytesToSplitPoint(final InputStream in, final OutputStream out, final boolean includeLineDelimiter) throws IOException { +private int readLine(final InputStream in, final OutputStream out, + final boolean includeLineDelimiter) throws IOException { int lastByte = -1; -long bytesRead = 0L; +int bytesRead = 0; while (true) { -in.mark(1); --- End diff -- This "in.mark(1)" was removed because marking and resetting the input stream has changed. Previously, this mark/reset was used to rollback the reading of the character after a \r. (More below.) Now, mark/reset is used at a higher level to potentially rollback an entire line read from the input flowfile if that line exceeds the size limit imposed by the FRAGMENT_MAX_SIZE property. Removing this in.mark() does not generate an IOException in the below (line 206) in.reset() because an in.mark() was previously made prior to the call to readLine() (line 239 or line 323.) Nonetheless, the overall logic is still incorrect. There are two logical in.reset() conditions: character-based and line-based. This must be corrected and made consistent (line-based). The special consideration of the \r character confuses me ('if' block beginning on Line 205.) In the original code, the byte after the \r is rolled back. Why? If the byte after \r is \n, the special consideration for \r is not required as the 'if' on line 194 captures the end of line. Is it valid and intended to have the \r indicate an end of line even when a subsequent \n is not present? (Note: testing of present SplitText processor shows incorrect behavior for only \r without \n; it duplicates the first character of the next "line".) If special consideration for \r is not required and the 'if' block beginning on line 205 is removed, both the in.mark and in.reset within the readLine() method go away, and I believe all will be correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markobean commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r50989494 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -230,100 +270,112 @@ public void onTrigger(final ProcessContext context, final ProcessSession session final ProcessorLog logger = getLogger(); final int headerCount = context.getProperty(HEADER_LINE_COUNT).asInteger(); final int splitCount = context.getProperty(LINE_SPLIT_COUNT).asInteger(); +final double maxFragmentSize; +if (context.getProperty(FRAGMENT_MAX_SIZE).isSet()) { +maxFragmentSize = context.getProperty(FRAGMENT_MAX_SIZE).asDataSize(DataUnit.B); +} else { +maxFragmentSize = Integer.MAX_VALUE; +} +final String headerMarker = context.getProperty(HEADER_MARKER).getValue(); final boolean removeTrailingNewlines = context.getProperty(REMOVE_TRAILING_NEWLINES).asBoolean(); - final ObjectHolder errorMessage = new ObjectHolder<>(null); -final ArrayList splitInfos = new ArrayList<>(); - final long startNanos = System.nanoTime(); final List splits = new ArrayList<>(); + session.read(flowFile, new InputStreamCallback() { @Override public void process(final InputStream rawIn) throws IOException { try (final BufferedInputStream bufferedIn = new BufferedInputStream(rawIn); final ByteCountingInputStream in = new ByteCountingInputStream(bufferedIn)) { -// if we have header lines, copy them into a ByteArrayOutputStream +// Identify header, if any final ByteArrayOutputStream headerStream = new ByteArrayOutputStream(); -final int headerLinesCopied = readLines(in, headerCount, headerStream, true); -if (headerLinesCopied < headerCount) { -errorMessage.set("Header Line Count is set to " + headerCount + " but file had only " + headerLinesCopied + " lines"); +final SplitInfo headerInfo = readHeader(headerCount, headerMarker, in, headerStream, true); +if (headerInfo.lengthLines < headerCount) { +errorMessage.set("Header Line Count is set to " + headerCount + " but file had only " ++ headerInfo.lengthLines + " lines"); return; } while (true) { -if (headerCount > 0) { -// if we have header lines, create a new FlowFile, copy the header lines to that file, -// and then start copying lines -final IntegerHolder linesCopied = new IntegerHolder(0); -FlowFile splitFile = session.create(flowFile); -try { -splitFile = session.write(splitFile, new OutputStreamCallback() { -@Override -public void process(final OutputStream rawOut) throws IOException { -try (final BufferedOutputStream out = new BufferedOutputStream(rawOut)) { +FlowFile splitFile = session.create(flowFile); +final SplitInfo flowFileInfo = new SplitInfo(); + +// if we have header lines, write them out +// and then start copying lines +try { +splitFile = session.write(splitFile, new OutputStreamCallback() { +@Override +public void process(final OutputStream rawOut) throws IOException { +try (final BufferedOutputStream out = new BufferedOutputStream(rawOut)) { +long lineCount = 0; +long byteCount = 0; +// Process header +if (headerInfo.lengthLines > 0) { +flowFileInfo.lengthBytes = headerInfo.lengthBytes; +byteCount = headerInfo.lengthBytes; headerStream.writeTo(out); -linesCopied.set(readLines(in, splitCount, out,
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markobean commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r50449323 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -198,23 +208,53 @@ private long countBytesToSplitPoint(final InputStream in, final OutputStream out return includeLineDelimiter ? bytesRead : bytesRead - 1; } -// keep track of what the last byte was that we read so that we can detect \r followed by some other +// keep track of what the last byte was that we read so that we can +// detect \r followed by some other // character. lastByte = nextByte; } } -private SplitInfo countBytesToSplitPoint(final InputStream in, final int numLines, final boolean keepAllNewLines) throws IOException { +private SplitInfo readHeader(final int numHeaderLines, + final String headerMarker, final InputStream in, + final OutputStream out, final boolean keepAllNewLines) +throws IOException { SplitInfo info = new SplitInfo(); - -while (info.lengthLines < numLines) { -final long bytesTillNext = countBytesToSplitPoint(in, null, keepAllNewLines || (info.lengthLines != numLines - 1)); -if (bytesTillNext <= 0L) { -break; +boolean isHeaderLine = true; + +// Read numHeaderLines from file, if specificed; a non-zero value takes precedence +// over headerMarker character string +if (numHeaderLines > 0) { +for (int i = 0; i < numHeaderLines; i++) { +int bytesRead = readLine(in, out, keepAllNewLines); +if (bytesRead == 0) { +break; +} +info.lengthBytes += bytesRead; +info.lengthLines++; +} +// Else, keep reading all lines that begin with headerMarker character string +} else if (headerMarker != null) { +while (true) { +in.mark(0); --- End diff -- This section of code is reading header lines a character at a time to determine if the line begins with the headerMarker character(s). After the line has been determined to be a header (or non-header), the buffer is reset so that a subsequent call to readLine() will capture all characters of the line. The maximum number of characters readHeader() will read before calling readLine() is the number of characters in the headerMarker String. Therefore, the in.mark(0) should be changed to in.mark(headerMarker.length). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user markap14 commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r50328396 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -198,23 +208,53 @@ private long countBytesToSplitPoint(final InputStream in, final OutputStream out return includeLineDelimiter ? bytesRead : bytesRead - 1; } -// keep track of what the last byte was that we read so that we can detect \r followed by some other +// keep track of what the last byte was that we read so that we can +// detect \r followed by some other // character. lastByte = nextByte; } } -private SplitInfo countBytesToSplitPoint(final InputStream in, final int numLines, final boolean keepAllNewLines) throws IOException { +private SplitInfo readHeader(final int numHeaderLines, + final String headerMarker, final InputStream in, + final OutputStream out, final boolean keepAllNewLines) +throws IOException { SplitInfo info = new SplitInfo(); - -while (info.lengthLines < numLines) { -final long bytesTillNext = countBytesToSplitPoint(in, null, keepAllNewLines || (info.lengthLines != numLines - 1)); -if (bytesTillNext <= 0L) { -break; +boolean isHeaderLine = true; + +// Read numHeaderLines from file, if specificed; a non-zero value takes precedence +// over headerMarker character string +if (numHeaderLines > 0) { +for (int i = 0; i < numHeaderLines; i++) { +int bytesRead = readLine(in, out, keepAllNewLines); +if (bytesRead == 0) { +break; +} +info.lengthBytes += bytesRead; +info.lengthLines++; +} +// Else, keep reading all lines that begin with headerMarker character string +} else if (headerMarker != null) { +while (true) { +in.mark(0); --- End diff -- This call to in.mark(0) means that when in.reset() is called, we will reset back 0 bytes, essentially making the mark/reset do nothing. Not sure that I am following the logic here with how the mark/reset are being used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user jskora commented on the pull request: https://github.com/apache/nifi/pull/135#issuecomment-168806914 I just pushed changes that address these issues, including tests for these and a couple more edge conditions. It has been squashed (thanks @trkurc) into a single commit for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user joewitt commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r48701049 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -75,6 +76,7 @@ // attribute keys public static final String SPLIT_LINE_COUNT = "text.line.count"; +public static final String SPLIT_SIZE = "fragment.size"; --- End diff -- recommend calling this "FRAGMENT_SIZE" to be consistent with the value and other fragment names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
Github user joewitt commented on a diff in the pull request: https://github.com/apache/nifi/pull/135#discussion_r48701229 --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java --- @@ -82,10 +84,16 @@ public static final PropertyDescriptor LINE_SPLIT_COUNT = new PropertyDescriptor.Builder() .name("Line Split Count") -.description("The number of lines that will be added to each split file") +.description("The number of lines that will be added to each split file, excluding header lines") .required(true) .addValidator(StandardValidators.POSITIVE_INTEGER_VALIDATOR) .build(); +public static final PropertyDescriptor FRAGMENT_MAX_SIZE = new PropertyDescriptor.Builder() +.name("Maximum Fragment Size") +.description("The maximum size of each split file, including header lines") --- End diff -- Seems like there could be some interesting edge conditions. For instance what if a given line is by itself larger than the max? Or, does the max get exceeded by the size of at most one line then split or is the max exceeded then a pushback mechanism goes to the previous line and does the split and then continues? I might be overthinking this but in any case would be good to let the user know how these edge cases are handled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...
GitHub user jskora opened a pull request: https://github.com/apache/nifi/pull/135 NIFI-1118 Update SplitText Processor - add support for split size lim⦠â¦its and header line markers. 1) Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. 2) Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jskora/nifi NIFI-1118 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/135.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #135 commit ddd394824e3f36429e5b43b755dd8d9df25f05ab Author: Joe SkoraDate: 2015-12-01T18:32:36Z NIFI-1118 Update SplitText Processor - add support for split size limits and header line markers. 1) Add "Maximum Fragment Size" property. A new split file will be created if the next line to be added to the current split file exceeds this user-defined maximum file size. 2) Add "Header Line Marker Character" property. Lines that begin with these user-defined character(s) will be considered header line(s) rather than a predetermined number of lines. The existing property "Header Line Count" must be zero for this new property and behavior to be used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---