Mark Payne created NIFI-10887:
---------------------------------
Summary: Improve Performance of ReplaceText processor
Key: NIFI-10887
URL: https://issues.apache.org/jira/browse/NIFI-10887
Project: Apache NiFi
Issue Type: Improvement
Reporter: Mark Payne
Assignee: Mark Payne
When performing some tests with the ReplaceText processor, I found that it
seemed to be quite a bit slower than I expected, especially when using a
Replacement Strategy of "Literal Replace" and when using a lot of small
FlowFiles.
As a result, I performed some profiling and identified a few areas that could
use some improvement:
* When using the Literal Replace strategy, we find matches using
{{Pattern.compile(Pattern.quote(...));}} and then using
{{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to just
using {{String.indexOf(...)}} and accounted for approximately 30% of the time
spent in the processor.
* A significant amount of time was spent flushing the write buffer, as it
flushes to disk when finished writing to each individual FlowFile. Even when we
set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets
delegated all the way down to the FileOutputStream. However, when using
ProcessSession.append(), we intercept this with a NonFlushableOutputStream. We
should do this when calling ProcessSession.write() as well. While it makes
sense to flush data from the Processor layer's buffer, there's no need to flush
past the session layer until the session is committed.
* A decent bit of time was spent in the session's get() method calling
{{{}final Set<FlowFileRecord> set =
unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> new
HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's
hashCode() method, which is the JVM default. We can easily implement hashCode()
to just return the hashCode of the identifier, which is a String. This is a
pre-computed hashcode so provides constant time of 0 ms (with the exception of
the method call itself) so eliminates the expense here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)