Danny Cranmer created FLINK-23802: ------------------------------------- Summary: [kinesis][efo] Reduce ReadTimeoutExceptions for Kinesis Consumer Key: FLINK-23802 URL: https://issues.apache.org/jira/browse/FLINK-23802 Project: Flink Issue Type: Improvement Components: Connectors / Kinesis Affects Versions: 1.13.2, 1.12.5, 1.12.4, 1.13.1, 1.12.3, 1.13.0, 1.12.2, 1.12.1, 1.12.0 Reporter: Danny Cranmer Assignee: Danny Cranmer Fix For: 1.14.0, 1.12.6, 1.13.3
h3. Background The Kinesis EFO consumer uses an async AWS SDK Netty client to read records from Kinesis. When the client is inactive for 30 seconds a {{ReadTimeoutException}} is thrown by Netty. The consumer will terminate the subscription, backoff and retry. Jobs with high backpressure can result in frequent {{ReadTImeoutException}} and the frequent backoff and retry can cause unnecessary overhead. h3. What? Reduce/eliminate {{ReadTimeoutException}} from the EFO consumer h3. How? There are 2 improvements to be made: 1. Request next record from the Flink source thread rather than the AWS SDK response thread. This means that there will always be space in the input buffer queue. The AWS SDK async response thread is no longer blocking on this queue. Backpressure is now applied by the Flink source thread rather than the AWS SDK thread. 2. Increase the Read Timeout (30s) to be higher than the maximum Shard subscription duration (5m) and enable TCP keep alive h3. References This has already been implemented and tested in [amazon-kinesis-connector-flink|https://github.com/awslabs/amazon-kinesis-connector-flink]: - [Prevent SDK threads blocking|https://github.com/awslabs/amazon-kinesis-connector-flink/pull/40] - [Increase read timeout and enable TCP keepalive|https://github.com/awslabs/amazon-kinesis-connector-flink/pull/42] -- This message was sent by Atlassian Jira (v8.3.4#803005)