Mark Payne created NIFI-10817:
---------------------------------

             Summary: Stateless NiFi does not release FlowFile content until 
flow completes
                 Key: NIFI-10817
                 URL: https://issues.apache.org/jira/browse/NIFI-10817
             Project: Apache NiFi
          Issue Type: Bug
          Components: NiFi Stateless
            Reporter: Mark Payne
            Assignee: Mark Payne


When a stateless flow is run, the content that is stored in the FlowFile 
repository is not cleaned up until the flow completes.

This means that if we have the following flow:

ConsumeKafka -> ReplaceText -> MergeContent (1000 FlowFile bucket) -> 
MergeContent (1000 FlowFile bucket) -> PutS3

The intent here would be to pull data, transform it, merge together many 
records, and put to s3. The expectation is that we'd have no more than 
1,000,000 Kafka messages in the content repo at a time, but we'll have two 
copies of each (1,000 FlowFiles, each containing 1000 kafka messages, waiting 
to be merged PLUS the merged result).

However, what we see is that we have the final merged content, plus the 1,000 
bundles ahead of it still in the repo (expected), PLUS the 1,000,000 individual 
transformed messages PLUS the original 1,000,000 messages. These intermediate 
FlowFiles' contents should be purged as aggressively as they can be. This is 
particularly important when using an in-memory content repository.

The in-memory content repository does not actually store the content within the 
repo but rather facilitates a mechanism by which the content can be written to 
the claim held by the FlowFileRecord. Then, when no longer referenced, we rely 
on garbage collection to clean up. However, it appears that the ProcessSession 
is holding on to all of these intermediate claims in its {{records}} member 
variable, and we can purge those much more aggressively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to