[
https://issues.apache.org/jira/browse/NIFI-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634519#comment-17634519
]
ASF subversion and git services commented on NIFI-10817:
--------------------------------------------------------
Commit bda624823b5252ac6f309e1631dcb23b62864f99 in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=bda624823b ]
NIFI-10817: Moved the calls in StandardProcessSession to 'resetState()' and
'acknowledgeRecords()' from the outer commit(boolean) to the inner
commit(Checkpoint, boolean). By moving the call here, the logic of
StandardProcessSession is unaffected. But the StatelessProcessSession that
inherits from it now has the benefit of having the state cleaned up when
calling super.commit(Checkpoint, boolean).
Signed-off-by: Matthew Burgess <[email protected]>
This closes #6663
> Stateless NiFi does not release FlowFile content until flow completes
> ---------------------------------------------------------------------
>
> Key: NIFI-10817
> URL: https://issues.apache.org/jira/browse/NIFI-10817
> Project: Apache NiFi
> Issue Type: Bug
> Components: NiFi Stateless
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
> When a stateless flow is run, the content that is stored in the FlowFile
> repository is not cleaned up until the flow completes.
> This means that if we have the following flow:
> ConsumeKafka -> ReplaceText -> MergeContent (1000 FlowFile bucket) ->
> MergeContent (1000 FlowFile bucket) -> PutS3
> The intent here would be to pull data, transform it, merge together many
> records, and put to s3. The expectation is that we'd have no more than
> 1,000,000 Kafka messages in the content repo at a time, but we'll have two
> copies of each (1,000 FlowFiles, each containing 1000 kafka messages, waiting
> to be merged PLUS the merged result).
> However, what we see is that we have the final merged content, plus the 1,000
> bundles ahead of it still in the repo (expected), PLUS the 1,000,000
> individual transformed messages PLUS the original 1,000,000 messages. These
> intermediate FlowFiles' contents should be purged as aggressively as they can
> be. This is particularly important when using an in-memory content repository.
> The in-memory content repository does not actually store the content within
> the repo but rather facilitates a mechanism by which the content can be
> written to the claim held by the FlowFileRecord. Then, when no longer
> referenced, we rely on garbage collection to clean up. However, it appears
> that the ProcessSession is holding on to all of these intermediate claims in
> its {{records}} member variable, and we can purge those much more
> aggressively.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)