Github user apiri commented on the pull request:
https://github.com/apache/nifi/pull/213#issuecomment-195367460
@mans2singh I think your logic for collecting the flowfiles should probably
just continuously grab individual files at a time. This continues until one of
a couple scenarios is reached: max buffer size is met or exceeded (based on
the the size of each flowfile that has been collected), there are no more files
coming into the processor (one or more flowfiles have been collected but we
have not yet eclipsed the max buffer size). In either case, we would close out
that collected batch of files and send them on their way much as you had before.
In terms of the 250MB by default, this is from the default batch size of
250MB and, the worst case scenario, each file is 1MB in size. While each of
these files is converted to a byte array for sending, they are continuously
sitting on the heap. In the event multiple instances of this processor are
running, we could quickly consume some big chunks of the heap. What I am
proposing is to either get rid of the batch size (that property no longer
exists) or make this a secondary consideration where we try to receive a
certain batch size but first ensure we do not exceed the configured buffer and
second, do not exceed the batch size, with similar semantics to the above
scenarios in terms of when those batches are sent on their way plus, when a
certain batch size has been reached.
Does that clarify a bit?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---