Mark Payne created NIFI-10887:
---------------------------------

             Summary: Improve Performance of ReplaceText processor
                 Key: NIFI-10887
                 URL: https://issues.apache.org/jira/browse/NIFI-10887
             Project: Apache NiFi
          Issue Type: Improvement
            Reporter: Mark Payne
            Assignee: Mark Payne


When performing some tests with the ReplaceText processor, I found that it 
seemed to be quite a bit slower than I expected, especially when using a 
Replacement Strategy of "Literal Replace" and when using a lot of small 
FlowFiles.

As a result, I performed some profiling and identified a few areas that could 
use some improvement:
 * When using the Literal Replace strategy, we  find matches using 
{{Pattern.compile(Pattern.quote(...));}} and then using 
{{{}Pattern.matcher(...).find(){}}}. This is quite inefficient compared to just 
using {{String.indexOf(...)}} and accounted for approximately 30% of the time 
spent in the processor.
 * A significant amount of time was spent flushing the write buffer, as it 
flushes to disk when finished writing to each individual FlowFile. Even when we 
set a Run Duration > 0 ms, we flush for each FlowFile. This flush() gets 
delegated all the way down to the FileOutputStream. However, when using 
ProcessSession.append(), we intercept this with a NonFlushableOutputStream. We 
should do this when calling ProcessSession.write() as well. While it makes 
sense to flush data from the Processor layer's buffer, there's no need to flush 
past the session layer until the session is committed.
 * A decent bit of time was spent in the session's get() method calling 
{{{}final Set<FlowFileRecord> set = 
unacknowledgedFlowFiles.computeIfAbsent(connection.getFlowFileQueue(), k -> new 
HashSet<>());{}}}. The time here was spent in StandardFlowFileQueue's 
hashCode() method, which is the JVM default. We can easily implement hashCode() 
to just return the hashCode of the identifier, which is a String. This is a 
pre-computed hashcode so provides constant time of 0 ms (with the exception of 
the method call itself) so eliminates the expense here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to