[
https://issues.apache.org/jira/browse/NIFI-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115409#comment-15115409
]
Mark Payne commented on NIFI-1008:
----------------------------------
Looking into this a bit more and investigating the usage patterns of the
Processors that really need this to work well, I think the best solution is to
provide a level of indirection at the FlowFile record. Rather than having
StandardProcessSession always returning a StandardFlowFIleRecord, it should
return a 'proxy' object that can wrap the StandardFlowFileRecord. If a
threshold is reached for the number of 'active flow files' in the session, the
session can be free to swap out the StandardFlowFileRecord for a
'SwappedFlowFile' and update the proxy to point to this new 'SwappedFlowFile'.
Then, any attempt to access it would simply swap in the standard record. This
allows us to swap out FlowFiles even if a Processor is holding onto the
FlowFile, as would be the case with SplitText.
I have made a lot of progress on this front, but it's a significant change to
some very important pieces of code, so I don't want to try to get this into the
0.5.0 release. I think it's better to wait for 0.6.0 so that there is plenty of
time to test everything diligently.
> NiFi should swap out FlowFiles to disk even before the session is committed
> ---------------------------------------------------------------------------
>
> Key: NIFI-1008
> URL: https://issues.apache.org/jira/browse/NIFI-1008
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Reporter: Mark Payne
> Assignee: Mark Payne
> Fix For: 0.6.0
>
>
> Currently, NiFi will swap out FlowFiles if there are a large number in a
> FlowFile Queue. This is done to avoid running out of JVM heap space. However,
> if we have a simple flow like GetFile -> SplitText and GetFile pulls in a
> large file, SplitText can quickly cause OutOfMemoryError. This is not because
> it buffers the content of the FlowFile in memory but rather because it holds
> the millions of FlowFile objects in memory. We can do better.
> When we call session.transfer for the FlowFiles, once we hit a magical
> threshold (say 10,000), we should swap those FlowFiles to disk and the
> session should transfer them to the queue "swapped out" flowfiles, rather
> than having to buffer all of these in memory and then swapping them out once
> they land in the queue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)