Mark Payne created NIFI-15669:
---------------------------------
Summary: Refactor ConsumeKinesis to avoid dependency on KCL
Key: NIFI-15669
URL: https://issues.apache.org/jira/browse/NIFI-15669
Project: Apache NiFi
Issue Type: Improvement
Components: Extensions
Reporter: Mark Payne
Assignee: Mark Payne
The existing ConsumeKinesis Processor depends on the Kinesis Client Library,
which is a reasonable choice, in general. However, the Kinesis Client Library
is very heavy-weight, brings in a lot of transitive dependencies and was
designed with some assumptions in mind that do not align well with use in NiFi.
Specifically, startup time is very long - typically 10-15 minutes for the first
iteration as it waits for all consumers to join before allowing any consumer to
pull data. While this makes sense for a big, long-running job that you might
run from Spark or Flink, it is too long to be reasonable for NiFi and often
leads to users thinking it's not working.
More importantly, the KCL library, when using Enhanced Fan-Out Consumer,
buffers up to 11 Kinesis Events per shard. This limit of 11 events is hard
coded and cannot be configured. An Event is effectively a batch of records,
which typically amounts to 1-2 MB. For a few shards and/or where NiFi is
keeping up without issue, this works well. However, if backpressure gets
applied we end up buffering 2 MB * 11 = 22 MB or so per shard. Many users have
Kinesis Streams with hundreds of shards, some even with 1,000+ shards. With
1,000 shards that's 22 GB of data buffered in heap. And this is considering
only the 'payload'. There is additional overhead from metadata and Java objects.
We need a faster, more stable solution. The new solution can still make use of
DynamoDB for checkpointing and support more or less the same options and
configuration in a backward compatible manner.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)