Github user apiri commented on the pull request:
https://github.com/apache/nifi/pull/213#issuecomment-194975737
@mans2singh I had a chance to sit down and revisit this. Overall, it looks
good and I was able to test a flow successfully putting to a Kinesis Firehose
which aggregated and dumped to S3.
One thing that was mentioned prior in my initial review that we still need
to cover is that of how we are handling batching. I do think we need to handle
that in a more constrained fashion given that file sizes could vary widely.
With how the processor is currently configured, it could hold up to 250MB in
memory, by default. Instead, what would your thoughts be on converting this to
a buffer size property. If people want batching, they can specify a given
memory size (perhaps something like 1 MB by default) and then we can wait until
that threshold is hit or no more input flowfiles are available, at which point
they are sent off in a batch. If batching is not desired, they can either
empty the buffer property or specify 0 bytes.
Thoughts on this approach? Ultimately, we are trying to avoid people from
incidentally causing issues with heap exhaustion. With the prescribed approach
here, people can get as aggressive as they wish with batching and have a
finitely constrained amount of space per instance.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---