[
https://issues.apache.org/jira/browse/NIFI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197275#comment-15197275
]
ASF GitHub Bot commented on NIFI-1118:
--------------------------------------
Github user markap14 commented on a diff in the pull request:
https://github.com/apache/nifi/pull/280#discussion_r56329501
--- Diff:
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
---
@@ -143,72 +199,82 @@ protected void init(final
ProcessorInitializationContext context) {
return properties;
}
- private int readLines(final InputStream in, final int maxNumLines,
final OutputStream out, final boolean keepAllNewLines) throws IOException {
+ private int readLines(final InputStream in, final int maxNumLines,
final long maxByteCount, final OutputStream out) throws IOException {
int numLines = 0;
+ long totalBytes = 0L;
for (int i = 0; i < maxNumLines; i++) {
- final long bytes = countBytesToSplitPoint(in, out,
keepAllNewLines || (i != maxNumLines - 1));
+ final long bytes = countBytesToSplitPoint(in, out, totalBytes,
maxByteCount);
+ totalBytes += bytes;
if (bytes <= 0) {
return numLines;
}
-
numLines++;
+ if (totalBytes >= maxByteCount && numLines > maxNumLines) {
+ break;
+ }
}
-
return numLines;
}
- private long countBytesToSplitPoint(final InputStream in, final
OutputStream out, final boolean includeLineDelimiter) throws IOException {
- int lastByte = -1;
+ private long countBytesToSplitPoint(final InputStream in, final
OutputStream out, final long bytesReadSoFar, final long maxSize) throws
IOException {
long bytesRead = 0L;
+ final ByteArrayOutputStream buffer = new ByteArrayOutputStream();
--- End diff --
It looks like we always buffer this data into Java heap, but we don't
actually use it for anything if out == null (which is the case more often than
not, I believe). This is pretty concerning, as we should avoid buffering
potentially large amounts of data into Java heap unless absolutely necessary. I
think this needs to be reworked so that we don't buffer the data here, if out
== null, since we won't make use of it anyway.
> Enable SplitText processor to limit line length and filter header lines
> -----------------------------------------------------------------------
>
> Key: NIFI-1118
> URL: https://issues.apache.org/jira/browse/NIFI-1118
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Bean
> Assignee: Joe Skora
> Fix For: 0.6.0
>
>
> Include the following functionality to the SplitText processor:
> 1) Maximum size limit of the split file(s)
> A new split file will be created if the next line to be added to the current
> split file exceeds a user-defined maximum file size
> 2) Header line marker
> User-defined character(s) can be used to identify the header line(s) of the
> data file rather than a predetermined number of lines
> These changes are additions, not a replacement of any property or behavior.
> In the case of header line marker, the existing property "Header Line Count"
> must be zero for the new property and behavior to be used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)