[
https://issues.apache.org/jira/browse/NIFI-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640190#comment-17640190
]
Mark Payne commented on NIFI-10887:
-----------------------------------
Benchmarks show that we see significant improvements in performance.
Screenshots of performance before and after for both Regex Replace and Literal
Replace. For both cases, used 4 concurrent tasks and 25 ms Run Duration.
When using Regex Replace, performance was slightly better, as expected.
When using Literal Replace, performance was more pronounced.
> Improve Performance of ReplaceText processor
> --------------------------------------------
>
> Key: NIFI-10887
> URL: https://issues.apache.org/jira/browse/NIFI-10887
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Labels: performance
> Fix For: 1.20.0
>
> Attachments: ReplaceText-LiteralReplace-AfterChanges.png,
> ReplaceText-LiteralReplace-BeforeChanges.png,
> ReplaceText-RegexReplace-AfterChanges.png,
> ReplaceText-RegexReplace-BeforeChanges.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When performing some tests with the ReplaceText processor, I found that it
> seemed to be quite a bit slower than I expected, especially when using a
> Replacement Strategy of "Literal Replace" and when using a lot of small
> FlowFiles.
> As a result, I performed some profiling and identified a few areas that could
> use some improvement:
> * When using the Literal Replace strategy, we find matches using
> {{Pattern.compile(Pattern.quote(...));}} and then using
> {{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to
> just using {{String.indexOf(...)}} and accounted for approximately 30% of the
> time spent in the processor.
> * A significant amount of time was spent flushing the write buffer, as it
> flushes to disk when finished writing to each individual FlowFile. Even when
> we set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets
> delegated all the way down to the FileOutputStream. However, when using
> ProcessSession.append(), we intercept this with a NonFlushableOutputStream.
> We should do this when calling ProcessSession.write() as well. While it makes
> sense to flush data from the Processor layer's buffer, there's no need to
> flush past the session layer until the session is committed.
> * A decent bit of time was spent in the session's get() method calling
> {{{}final Set<FlowFileRecord> set =
> unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k ->
> new HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's
> hashCode() method, which is the JVM default. We can easily implement
> hashCode() to just return the hashCode of the identifier, which is a String.
> This is a pre-computed hashcode so provides constant time of 0 ms (with the
> exception of the method call itself) so eliminates the expense here.
> * When using a Run Duration > 0 ms, we can hold InputStreams open by
> processing multiple FlowFiles in a given Session. This can also significantly
> improve performance. As such, we should make the default run duration 25 ms
> instead of 0 ms.
> * A common pattern with ReplaceText is to prepend text to the beginning of a
> FlowFile, or line. And then use another ReplaceText to append text to the end
> of a FlowFile, or line. We should have a strategy for "Surround" that allow
> us to both Prepend text and Append text. This will result in double the
> performance for this use case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)