[
https://issues.apache.org/jira/browse/NIFI-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652020#comment-16652020
]
Peter Wicks commented on NIFI-5689:
-----------------------------------
[~ottobackwards] and I talked about this a lot. Here is a unit test
demonstrating the issue. This fails with current master by limiting the buffer
size so that the \r is isolated at the end of the buffer. This unit test,
without the buffer restriction, returns 'abc.txt abc.txt abc.txt', eacn on it's
own line. But with the restricted buffer size, we get 4 copies of the text
because \r is one line and the following \n is considered a second.
{code:java}
@Test
public void testWindowsLineEndingsAcrossBufferLineByLine() {
final TestRunner runner = getRunner();
runner.setProperty(ReplaceText.EVALUATION_MODE, ReplaceText.LINE_BY_LINE);
runner.setProperty(ReplaceText.REPLACEMENT_STRATEGY,
ReplaceText.ALWAYS_REPLACE);
runner.setProperty(ReplaceText.SEARCH_VALUE, "i do not exist anywhere in
the text");
runner.setProperty(ReplaceText.REPLACEMENT_VALUE, "${filename}");
runner.setProperty(ReplaceText.MAX_BUFFER_SIZE, "13 B");
final Map<String, String> attributes = new HashMap<>();
attributes.put("filename", "abc.txt");
runner.enqueue("Hello\nWorld!\r\ntoday!\n".getBytes(), attributes);
runner.run();
runner.assertAllFlowFilesTransferred(ReplaceText.REL_SUCCESS, 1);
final MockFlowFile out =
runner.getFlowFilesForRelationship(ReplaceText.REL_SUCCESS).get(0);
out.assertContentEquals("abc.txt\nabc.txt\r\nabc.txt\n");
}
{code}
> ReplaceText does not handle end of line correctly on buffer boundary
> --------------------------------------------------------------------
>
> Key: NIFI-5689
> URL: https://issues.apache.org/jira/browse/NIFI-5689
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 1.7.1
> Reporter: Sergei Zhirikov
> Priority: Minor
> Attachments: Text_Parsing_Bug.xml
>
>
> ReplaceText appears to misbehave under the following conditions:
> * The input flow file contains text with Windows-style line endings (CR-LF).
> * ReplaceText is configured to perform "Regex Replace" in "Line-by-Line"
> mode.
> * The "Maximum Buffer Size" is set to a value smaller than the whole file
> content,
> but large enough to fit any of the text lines in the file.
> * A CR-LF pair of characters in one of the lines happens to be split across
> two buffers,
> that is CR is the last character in one buffer and LF is the first one in the
> following one.
> An example flow template is attached to illustrate the problem.
> In the example, the regular expression is intended to remove white space at
> the end of each line. It operates as expected in all lines except the third
> one (containing "GHI"). That line satisfies the conditions described above.
> As a result the CR character in the end of the line is removed, which does
> not happen in other lines.
> In some more complicated cases both CR and LF end up being removed,
> effectively resulting in two lines being joined into one. Although, I haven't
> managed to create a simple test case to reproduce that.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)