Danny Cranmer created FLINK-23802:
-------------------------------------
Summary: [kinesis][efo] Reduce ReadTimeoutExceptions for Kinesis
Consumer
Key: FLINK-23802
URL: https://issues.apache.org/jira/browse/FLINK-23802
Project: Flink
Issue Type: Improvement
Components: Connectors / Kinesis
Affects Versions: 1.13.2, 1.12.5, 1.12.4, 1.13.1, 1.12.3, 1.13.0, 1.12.2,
1.12.1, 1.12.0
Reporter: Danny Cranmer
Assignee: Danny Cranmer
Fix For: 1.14.0, 1.12.6, 1.13.3
h3. Background
The Kinesis EFO consumer uses an async AWS SDK Netty client to read records
from Kinesis. When the client is inactive for 30 seconds a
{{ReadTimeoutException}} is thrown by Netty. The consumer will terminate the
subscription, backoff and retry. Jobs with high backpressure can result in
frequent {{ReadTImeoutException}} and the frequent backoff and retry can cause
unnecessary overhead.
h3. What?
Reduce/eliminate {{ReadTimeoutException}} from the EFO consumer
h3. How?
There are 2 improvements to be made:
1. Request next record from the Flink source thread rather than the AWS SDK
response thread. This means that there will always be space in the input buffer
queue. The AWS SDK async response thread is no longer blocking on this queue.
Backpressure is now applied by the Flink source thread rather than the AWS SDK
thread.
2. Increase the Read Timeout (30s) to be higher than the maximum Shard
subscription duration (5m) and enable TCP keep alive
h3. References
This has already been implemented and tested in
[amazon-kinesis-connector-flink|https://github.com/awslabs/amazon-kinesis-connector-flink]:
- [Prevent SDK threads
blocking|https://github.com/awslabs/amazon-kinesis-connector-flink/pull/40]
- [Increase read timeout and enable TCP
keepalive|https://github.com/awslabs/amazon-kinesis-connector-flink/pull/42]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)