[
https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17675661#comment-17675661
]
ASF subversion and git services commented on NIFI-10887:
--------------------------------------------------------
Commit d5c79fdcd1c806f6376c101f392e333c4a86b805 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=d5c79fdcd1 ]
NIFI-10887: Addressed performance concerned. Use String.indexOf() instead of
Pattern.matcher() when using Literal Replace. Use a NonFlushableOutputStream
when ProcessSession.write() is called. Implemented hashCode() on
AbstractConnection. Updated default Run Schedule on ReplaceText from 0 ms to 25
ms. Added a Surround Replacement strategy that allows both prepending and
appending text. Updated unit tests to account for this.
Signed-off-by: Matthew Burgess <[email protected]>
This closes #6724
> Improve Performance of ReplaceText processor
> --------------------------------------------
>
> Key: NIFI-10887
> URL: https://issues.apache.org/jira/browse/NIFI-10887
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Labels: performance
> Fix For: 1.20.0
>
> Attachments: ReplaceText-LiteralReplace-0msRunDuration.png,
> ReplaceText-LiteralReplace-25msRunDuration.png,
> ReplaceText-LiteralReplace-AfterChanges.png,
> ReplaceText-LiteralReplace-BeforeChanges.png,
> ReplaceText-RegexReplace-AfterChanges.png,
> ReplaceText-RegexReplace-BeforeChanges.png
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> When performing some tests with the ReplaceText processor, I found that it
> seemed to be quite a bit slower than I expected, especially when using a
> Replacement Strategy of "Literal Replace" and when using a lot of small
> FlowFiles.
> As a result, I performed some profiling and identified a few areas that could
> use some improvement:
> * When using the Literal Replace strategy, we find matches using
> {{Pattern.compile(Pattern.quote(...));}} and then using
> {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to
> just using {{String.indexOf(...)}} and accounted for approximately 30% of the
> time spent in the processor.
> * A significant amount of time was spent flushing the write buffer, as it
> flushes to disk when finished writing to each individual FlowFile. Even when
> we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets
> delegated all the way down to the FileOutputStream. However, when using
> ProcessSession.append(), we intercept this with a NonFlushableOutputStream.
> We should do this when calling ProcessSession.write() as well. While it makes
> sense to flush data from the Processor layer's buffer, there's no need to
> flush past the session layer until the session is committed.
> * A decent bit of time was spent in the session's get() method calling
> {{{}final Set<FlowFileRecord> set =
> unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k ->
> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's
> hashCode() method, which is the JVM default. We can easily implement
> hashCode() to just return the hashCode of the identifier, which is a String.
> This is a pre-computed hashcode so provides constant time of 0 ms (with the
> exception of the method call itself) so eliminates the expense here.
> * When using a Run Duration > 0 ms, we can hold InputStreams open by
> processing multiple FlowFiles in a given Session. This can also significantly
> improve performance. As such, we should make the default run duration 25 ms
> instead of 0 ms.
> * A common pattern with ReplaceText is to prepend text to the beginning of a
> FlowFile, or line. And then use another ReplaceText to append text to the end
> of a FlowFile, or line. We should have a strategy for "Surround" that allow
> us to both Prepend text and Append text. This will result in double the
> performance for this use case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)