[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-04-19 Thread joewitt
Github user joewitt commented on the pull request:

https://github.com/apache/nifi/pull/280#issuecomment-211956655
  
thanks @markobean and @jskora .  What do you think about making ignore 
newlines only be honored/supported when not using the new features you're 
planning to include or only in very specific configurations?  I ask because 
this, admittedly mistaken, feature is used a lot.  Ultimately if that seems to 
unwieldy we can punt that feature in 1.0, add your new capabilities, and 
support end of line removal on ReplaceText instead.  We just need to remember 
to document this in the migration guide for 1.0 as this could cause some pretty 
funky behavior changes for folks.

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-04-19 Thread markobean
Github user markobean commented on the pull request:

https://github.com/apache/nifi/pull/280#issuecomment-211915220
  
@joewitt Talked with @jskora and concur this PR can be closed. A new PR 
will be opened later with the new features and also with the Return Trailing 
Newlines bug fix included (if RTN is still included.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-04-18 Thread joewitt
Github user joewitt commented on the pull request:

https://github.com/apache/nifi/pull/280#issuecomment-211726456
  
@jskora do you think this PR can be closed now given the updates made to 
fix the underlying defects found?  A new PR could be submitted which adds the 
proposed features or goes into ReplaceText or a new processor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-20 Thread markap14
Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56329501
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -143,72 +199,82 @@ protected void init(final 
ProcessorInitializationContext context) {
 return properties;
 }
 
-private int readLines(final InputStream in, final int maxNumLines, 
final OutputStream out, final boolean keepAllNewLines) throws IOException {
+private int readLines(final InputStream in, final int maxNumLines, 
final long maxByteCount, final OutputStream out) throws IOException {
 int numLines = 0;
+long totalBytes = 0L;
 for (int i = 0; i < maxNumLines; i++) {
-final long bytes = countBytesToSplitPoint(in, out, 
keepAllNewLines || (i != maxNumLines - 1));
+final long bytes = countBytesToSplitPoint(in, out, totalBytes, 
maxByteCount);
+totalBytes += bytes;
 if (bytes <= 0) {
 return numLines;
 }
-
 numLines++;
+if (totalBytes >= maxByteCount && numLines > maxNumLines) {
+break;
+}
 }
-
 return numLines;
 }
 
-private long countBytesToSplitPoint(final InputStream in, final 
OutputStream out, final boolean includeLineDelimiter) throws IOException {
-int lastByte = -1;
+private long countBytesToSplitPoint(final InputStream in, final 
OutputStream out, final long bytesReadSoFar, final long maxSize) throws 
IOException {
 long bytesRead = 0L;
+final ByteArrayOutputStream buffer = new ByteArrayOutputStream();
--- End diff --

It looks like we always buffer this data into Java heap, but we don't 
actually use it for anything if out == null (which is the case more often than 
not, I believe). This is pretty concerning, as we should avoid buffering 
potentially large amounts of data into Java heap unless absolutely necessary. I 
think this needs to be reworked so that we don't buffer the data here, if out 
== null, since we won't make use of it anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-20 Thread jskora
Github user jskora commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56322564
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -134,6 +168,28 @@ protected void init(final 
ProcessorInitializationContext context) {
 }
 
 @Override
+protected Collection 
customValidate(ValidationContext validationContext) {
+ArrayList results = new ArrayList<>();
--- End diff --

I'm fixing these.  Update will be pushed shortly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-19 Thread markap14
Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56327650
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -143,72 +199,82 @@ protected void init(final 
ProcessorInitializationContext context) {
 return properties;
 }
 
-private int readLines(final InputStream in, final int maxNumLines, 
final OutputStream out, final boolean keepAllNewLines) throws IOException {
+private int readLines(final InputStream in, final int maxNumLines, 
final long maxByteCount, final OutputStream out) throws IOException {
 int numLines = 0;
+long totalBytes = 0L;
 for (int i = 0; i < maxNumLines; i++) {
-final long bytes = countBytesToSplitPoint(in, out, 
keepAllNewLines || (i != maxNumLines - 1));
+final long bytes = countBytesToSplitPoint(in, out, totalBytes, 
maxByteCount);
+totalBytes += bytes;
 if (bytes <= 0) {
 return numLines;
 }
-
 numLines++;
+if (totalBytes >= maxByteCount && numLines > maxNumLines) {
--- End diff --

The logic here appears to be incorrect.
numLines is incremented for each iteration of the loop (unless we return 
before it is incremented).
This means that numLines <= i
The loop's condition indicates i < maxNumLines
So numLines <= i < maxNumLines
So it is always the case that numLines < maxNumLines, so this condition 
will never be satisfied because numLines will never be > maxNumLines

Now, looking through the code and doing a bit of testing, this does not 
appear to return an incorrect result, since countBytesToSplitPoint will handle 
the logic appropriately itself, but this should be fixed before it is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-19 Thread markobean
Github user markobean commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56328470
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -143,72 +199,82 @@ protected void init(final 
ProcessorInitializationContext context) {
 return properties;
 }
 
-private int readLines(final InputStream in, final int maxNumLines, 
final OutputStream out, final boolean keepAllNewLines) throws IOException {
+private int readLines(final InputStream in, final int maxNumLines, 
final long maxByteCount, final OutputStream out) throws IOException {
 int numLines = 0;
+long totalBytes = 0L;
 for (int i = 0; i < maxNumLines; i++) {
-final long bytes = countBytesToSplitPoint(in, out, 
keepAllNewLines || (i != maxNumLines - 1));
+final long bytes = countBytesToSplitPoint(in, out, totalBytes, 
maxByteCount);
+totalBytes += bytes;
 if (bytes <= 0) {
 return numLines;
 }
-
 numLines++;
+if (totalBytes >= maxByteCount && numLines > maxNumLines) {
--- End diff --

Agreed. "&& numLines > maxNumLines" condition can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-19 Thread jskora
Github user jskora commented on the pull request:

https://github.com/apache/nifi/pull/280#issuecomment-197388774
  
I think the issues are all covered now and tests have been expanded to get 
all critical logic and edge cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-15 Thread markap14
Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56234292
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -134,6 +168,28 @@ protected void init(final 
ProcessorInitializationContext context) {
 }
 
 @Override
+protected Collection 
customValidate(ValidationContext validationContext) {
+ArrayList results = new ArrayList<>();
+
+results.add(new ValidationResult.Builder()
+.subject("Remove Trailing Newlines")
+
.valid(!validationContext.getProperty(REMOVE_TRAILING_NEWLINES).asBoolean())
+.explanation("Property is deprecated; value must be set to 
false. Future releases may remove this Property.")
+.build());
+
+final boolean invalidState = 
(validationContext.getProperty(LINE_SPLIT_COUNT).asInteger() == 0
+&& 
validationContext.getProperty(FRAGMENT_MAX_SIZE).asDataSize(DataUnit.B) == 
null);
--- End diff --

I think this line should be:
&& !validationContext.getProperty(FRAGMENT_MAX_SIZE).isSet() );


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-15 Thread markap14
Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56234376
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -134,6 +168,28 @@ protected void init(final 
ProcessorInitializationContext context) {
 }
 
 @Override
+protected Collection 
customValidate(ValidationContext validationContext) {
+ArrayList results = new ArrayList<>();
--- End diff --

Is there any reason we are declaring the type here as ArrayList, rather 
than List?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-15 Thread markap14
Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/280#discussion_r56233876
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -63,9 +71,16 @@
 @SupportsBatching
 @Tags({"split", "text"})
 @InputRequirement(Requirement.INPUT_REQUIRED)
-@CapabilityDescription("Splits a text file into multiple smaller text 
files on line boundaries, each having up to a configured number of lines")
+//@CapabilityDescription("Splits a text file into multiple smaller text 
files on line boundaries, each having up to a configured number of lines")
--- End diff --

The commented-out line should probably be removed :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-15 Thread jskora
GitHub user jskora opened a pull request:

https://github.com/apache/nifi/pull/280

NIFI-1118 Update SplitText Processor - add support for split size limits 
and header line markers.

* Add "Maximum Fragment Size" property.  A new split file will be created 
if the next line to be added to the current split file exceeds this 
user-defined maximum file size.  In the case where an input line is longer than 
the fragment size, this line will be output in a separate split file that will 
exceed the maximum fragment size.
* Add "Header Line Marker Character" property.  Lines that begin with these 
user-defined character(s) will be considered header line(s) rather than a 
predetermined number of lines.  The existing property "Header Line Count" must 
be zero for this new property and behavior to be used.
* Deprecated the "Remove Trailing Newlines" property.
* Fixed conditional that incorrectly suppressed splits where the content 
line count equaled the header line count and did not remove empty splits from 
the session.
* Minor formatting cleanup.
* Exclude test files from RAT check in pom.xml.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jskora/nifi NIFI-1118

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #280


commit 914c5a2c52b19d153077a89b59f751aa49ddf86c
Author: Joe Skora 
Date:   2016-03-15T14:31:00Z

NIFI-1118 Update SplitText Processor - add support for split size limits 
and header line markers.
* Add "Maximum Fragment Size" property.  A new split file will be created 
if the next line to be added to the current split file exceeds this 
user-defined maximum file size.  In the case where an input line is longer than 
the fragment size, this line will be output in a separate split file that will 
exceed the maximum fragment size.
* Add "Header Line Marker Character" property.  Lines that begin with these 
user-defined character(s) will be considered header line(s) rather than a 
predetermined number of lines.  The existing property "Header Line Count" must 
be zero for this new property and behavior to be used.
* Fixed conditional that incorrectly suppressed splits where the content 
line count equaled the header line count and did not remove empty splits from 
the session.
* Minor formatting cleanup.
* Exclude test files from RAT check in pom.xml.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-15 Thread jskora
Github user jskora commented on the pull request:

https://github.com/apache/nifi/pull/135#issuecomment-196847105
  
Cancelling this pull request, will followup with new PR for updated and 
current source branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-03-15 Thread jskora
Github user jskora closed the pull request at:

https://github.com/apache/nifi/pull/135


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-27 Thread markobean
Github user markobean commented on a diff in the pull request:

https://github.com/apache/nifi/pull/135#discussion_r50990975
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -143,26 +165,12 @@ protected void init(final 
ProcessorInitializationContext context) {
 return properties;
 }
 
-private int readLines(final InputStream in, final int maxNumLines, 
final OutputStream out, final boolean keepAllNewLines) throws IOException {
-int numLines = 0;
-for (int i = 0; i < maxNumLines; i++) {
-final long bytes = countBytesToSplitPoint(in, out, 
keepAllNewLines || (i != maxNumLines - 1));
-if (bytes <= 0) {
-return numLines;
-}
-
-numLines++;
-}
-
-return numLines;
-}
-
-private long countBytesToSplitPoint(final InputStream in, final 
OutputStream out, final boolean includeLineDelimiter) throws IOException {
+private int readLine(final InputStream in, final OutputStream out,
+  final boolean includeLineDelimiter) throws 
IOException {
 int lastByte = -1;
-long bytesRead = 0L;
+int bytesRead = 0;
 
 while (true) {
-in.mark(1);
--- End diff --

This "in.mark(1)" was removed because marking and resetting the input 
stream has changed. Previously, this mark/reset was used to rollback the 
reading of the character after a \r. (More below.) Now, mark/reset is used at a 
higher level to potentially rollback an entire line read from the input 
flowfile if that line exceeds the size limit imposed by the FRAGMENT_MAX_SIZE 
property.
Removing this in.mark() does not generate an IOException in the below (line 
206) in.reset() because an in.mark() was previously made prior to the call to 
readLine() (line 239 or line 323.) Nonetheless, the overall logic is still 
incorrect. There are two logical in.reset() conditions: character-based and 
line-based. This must be corrected and made consistent (line-based).
The special consideration of the \r character confuses me ('if' block 
beginning on Line 205.) In the original code, the byte after the \r is rolled 
back. Why? If the byte after \r is \n, the special consideration for \r is not 
required as the 'if' on line 194 captures the end of line. Is it valid and 
intended to have the \r indicate an end of line even when a subsequent \n is 
not present? (Note: testing of present SplitText processor shows incorrect 
behavior for only \r without \n; it duplicates the first character of the next 
"line".)
If special consideration for \r is not required and the 'if' block 
beginning on line 205 is removed, both the in.mark and in.reset within the 
readLine() method go away, and I believe all will be correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-27 Thread markobean
Github user markobean commented on a diff in the pull request:

https://github.com/apache/nifi/pull/135#discussion_r50989494
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -230,100 +270,112 @@ public void onTrigger(final ProcessContext context, 
final ProcessSession session
 final ProcessorLog logger = getLogger();
 final int headerCount = 
context.getProperty(HEADER_LINE_COUNT).asInteger();
 final int splitCount = 
context.getProperty(LINE_SPLIT_COUNT).asInteger();
+final double maxFragmentSize;
+if (context.getProperty(FRAGMENT_MAX_SIZE).isSet()) {
+maxFragmentSize = 
context.getProperty(FRAGMENT_MAX_SIZE).asDataSize(DataUnit.B);
+} else {
+maxFragmentSize = Integer.MAX_VALUE;
+}
+final String headerMarker = 
context.getProperty(HEADER_MARKER).getValue();
 final boolean removeTrailingNewlines = 
context.getProperty(REMOVE_TRAILING_NEWLINES).asBoolean();
-
 final ObjectHolder errorMessage = new ObjectHolder<>(null);
-final ArrayList splitInfos = new ArrayList<>();
-
 final long startNanos = System.nanoTime();
 final List splits = new ArrayList<>();
+
 session.read(flowFile, new InputStreamCallback() {
 @Override
 public void process(final InputStream rawIn) throws 
IOException {
 try (final BufferedInputStream bufferedIn = new 
BufferedInputStream(rawIn);
 final ByteCountingInputStream in = new 
ByteCountingInputStream(bufferedIn)) {
 
-// if we have header lines, copy them into a 
ByteArrayOutputStream
+// Identify header, if any
 final ByteArrayOutputStream headerStream = new 
ByteArrayOutputStream();
-final int headerLinesCopied = readLines(in, 
headerCount, headerStream, true);
-if (headerLinesCopied < headerCount) {
-errorMessage.set("Header Line Count is set to " + 
headerCount + " but file had only " + headerLinesCopied + " lines");
+final SplitInfo headerInfo = readHeader(headerCount, 
headerMarker, in, headerStream, true);
+if (headerInfo.lengthLines < headerCount) {
+errorMessage.set("Header Line Count is set to " + 
headerCount + " but file had only "
++ headerInfo.lengthLines + " lines");
 return;
 }
 
 while (true) {
-if (headerCount > 0) {
-// if we have header lines, create a new 
FlowFile, copy the header lines to that file,
-// and then start copying lines
-final IntegerHolder linesCopied = new 
IntegerHolder(0);
-FlowFile splitFile = session.create(flowFile);
-try {
-splitFile = session.write(splitFile, new 
OutputStreamCallback() {
-@Override
-public void process(final OutputStream 
rawOut) throws IOException {
-try (final BufferedOutputStream 
out = new BufferedOutputStream(rawOut)) {
+FlowFile splitFile = session.create(flowFile);
+final SplitInfo flowFileInfo = new SplitInfo();
+
+// if we have header lines, write them out
+// and then start copying lines
+try {
+splitFile = session.write(splitFile, new 
OutputStreamCallback() {
+@Override
+public void process(final OutputStream 
rawOut) throws IOException {
+try (final BufferedOutputStream out = 
new BufferedOutputStream(rawOut)) {
+long lineCount = 0;
+long byteCount = 0;
+// Process header
+if (headerInfo.lengthLines > 0) {
+flowFileInfo.lengthBytes = 
headerInfo.lengthBytes;
+byteCount = 
headerInfo.lengthBytes;
 headerStream.writeTo(out);
-linesCopied.set(readLines(in, 
splitCount, out, 

[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-21 Thread markobean
Github user markobean commented on a diff in the pull request:

https://github.com/apache/nifi/pull/135#discussion_r50449323
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -198,23 +208,53 @@ private long countBytesToSplitPoint(final InputStream 
in, final OutputStream out
 return includeLineDelimiter ? bytesRead : bytesRead - 1;
 }
 
-// keep track of what the last byte was that we read so that 
we can detect \r followed by some other
+// keep track of what the last byte was that we read so that 
we can
+// detect \r followed by some other
 // character.
 lastByte = nextByte;
 }
 }
 
-private SplitInfo countBytesToSplitPoint(final InputStream in, final 
int numLines, final boolean keepAllNewLines) throws IOException {
+private SplitInfo readHeader(final int numHeaderLines,
+ final String headerMarker, final 
InputStream in,
+ final OutputStream out, final boolean 
keepAllNewLines)
+throws IOException {
 SplitInfo info = new SplitInfo();
-
-while (info.lengthLines < numLines) {
-final long bytesTillNext = countBytesToSplitPoint(in, null, 
keepAllNewLines || (info.lengthLines != numLines - 1));
-if (bytesTillNext <= 0L) {
-break;
+boolean isHeaderLine = true;
+
+// Read numHeaderLines from file, if specificed; a non-zero value 
takes precedence
+// over headerMarker character string
+if (numHeaderLines > 0) {
+for (int i = 0; i < numHeaderLines; i++) {
+int bytesRead = readLine(in, out, keepAllNewLines);
+if (bytesRead == 0) {
+break;
+}
+info.lengthBytes += bytesRead;
+info.lengthLines++;
+}
+// Else, keep reading all lines that begin with headerMarker 
character string
+} else if (headerMarker != null) {
+while (true) {
+in.mark(0);
--- End diff --

This section of code is reading header lines a character at a time to 
determine if the line begins with the headerMarker character(s). After the line 
has been determined to be a header (or non-header), the buffer is reset so that 
a subsequent call to readLine() will capture all characters of the line. The 
maximum number of characters readHeader() will read before calling readLine() 
is the number of characters in the headerMarker String. Therefore, the 
in.mark(0) should be changed to in.mark(headerMarker.length).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-20 Thread markap14
Github user markap14 commented on a diff in the pull request:

https://github.com/apache/nifi/pull/135#discussion_r50328396
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -198,23 +208,53 @@ private long countBytesToSplitPoint(final InputStream 
in, final OutputStream out
 return includeLineDelimiter ? bytesRead : bytesRead - 1;
 }
 
-// keep track of what the last byte was that we read so that 
we can detect \r followed by some other
+// keep track of what the last byte was that we read so that 
we can
+// detect \r followed by some other
 // character.
 lastByte = nextByte;
 }
 }
 
-private SplitInfo countBytesToSplitPoint(final InputStream in, final 
int numLines, final boolean keepAllNewLines) throws IOException {
+private SplitInfo readHeader(final int numHeaderLines,
+ final String headerMarker, final 
InputStream in,
+ final OutputStream out, final boolean 
keepAllNewLines)
+throws IOException {
 SplitInfo info = new SplitInfo();
-
-while (info.lengthLines < numLines) {
-final long bytesTillNext = countBytesToSplitPoint(in, null, 
keepAllNewLines || (info.lengthLines != numLines - 1));
-if (bytesTillNext <= 0L) {
-break;
+boolean isHeaderLine = true;
+
+// Read numHeaderLines from file, if specificed; a non-zero value 
takes precedence
+// over headerMarker character string
+if (numHeaderLines > 0) {
+for (int i = 0; i < numHeaderLines; i++) {
+int bytesRead = readLine(in, out, keepAllNewLines);
+if (bytesRead == 0) {
+break;
+}
+info.lengthBytes += bytesRead;
+info.lengthLines++;
+}
+// Else, keep reading all lines that begin with headerMarker 
character string
+} else if (headerMarker != null) {
+while (true) {
+in.mark(0);
--- End diff --

This call to in.mark(0) means that when in.reset() is called, we will reset 
back 0 bytes, essentially making the mark/reset do nothing. Not sure that I am 
following the logic here with how the mark/reset are being used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-04 Thread jskora
Github user jskora commented on the pull request:

https://github.com/apache/nifi/pull/135#issuecomment-168806914
  
I just pushed changes that address these issues, including tests for these 
and a couple more edge conditions.  It has been squashed (thanks @trkurc) into 
a single commit for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-03 Thread joewitt
Github user joewitt commented on a diff in the pull request:

https://github.com/apache/nifi/pull/135#discussion_r48701049
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -75,6 +76,7 @@
 
 // attribute keys
 public static final String SPLIT_LINE_COUNT = "text.line.count";
+public static final String SPLIT_SIZE = "fragment.size";
--- End diff --

recommend calling this "FRAGMENT_SIZE" to be consistent with the value and 
other fragment names


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2016-01-03 Thread joewitt
Github user joewitt commented on a diff in the pull request:

https://github.com/apache/nifi/pull/135#discussion_r48701229
  
--- Diff: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitText.java
 ---
@@ -82,10 +84,16 @@
 
 public static final PropertyDescriptor LINE_SPLIT_COUNT = new 
PropertyDescriptor.Builder()
 .name("Line Split Count")
-.description("The number of lines that will be added to each 
split file")
+.description("The number of lines that will be added to each 
split file, excluding header lines")
 .required(true)
 .addValidator(StandardValidators.POSITIVE_INTEGER_VALIDATOR)
 .build();
+public static final PropertyDescriptor FRAGMENT_MAX_SIZE = new 
PropertyDescriptor.Builder()
+.name("Maximum Fragment Size")
+.description("The maximum size of each split file, including 
header lines")
--- End diff --

Seems like there could be some interesting edge conditions.  For instance 
what if a given line is by itself larger than the max?  Or, does the max get 
exceeded by the size of at most one line then split or is the max exceeded then 
a pushback mechanism goes to the previous line and does the split and then 
continues?  I might be overthinking this but in any case would be good to let 
the user know how these edge cases are handled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1118 Update SplitText Processor - add supp...

2015-12-01 Thread jskora
GitHub user jskora opened a pull request:

https://github.com/apache/nifi/pull/135

NIFI-1118 Update SplitText Processor - add support for split size lim…

…its and header line markers.

1) Add "Maximum Fragment Size" property.  A new split file will be created 
if the next line to be added to the current split file exceeds this 
user-defined maximum file size.
2) Add "Header Line Marker Character" property.  Lines that begin with 
these user-defined character(s) will be considered header line(s) rather than a 
predetermined number of lines.  The existing property "Header Line Count" must 
be zero for this new property and behavior to be used.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jskora/nifi NIFI-1118

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #135


commit ddd394824e3f36429e5b43b755dd8d9df25f05ab
Author: Joe Skora 
Date:   2015-12-01T18:32:36Z

NIFI-1118 Update SplitText Processor - add support for split size limits 
and header line markers.
1) Add "Maximum Fragment Size" property.  A new split file will be created 
if the next line to be added to the current split file exceeds this 
user-defined maximum file size.
2) Add "Header Line Marker Character" property.  Lines that begin with 
these user-defined character(s) will be considered header line(s) rather than a 
predetermined number of lines.  The existing property "Header Line Count" must 
be zero for this new property and behavior to be used.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---