Danny Cranmer created FLINK-23802:
-------------------------------------

             Summary: [kinesis][efo] Reduce ReadTimeoutExceptions for Kinesis 
Consumer
                 Key: FLINK-23802
                 URL: https://issues.apache.org/jira/browse/FLINK-23802
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / Kinesis
    Affects Versions: 1.13.2, 1.12.5, 1.12.4, 1.13.1, 1.12.3, 1.13.0, 1.12.2, 
1.12.1, 1.12.0
            Reporter: Danny Cranmer
            Assignee: Danny Cranmer
             Fix For: 1.14.0, 1.12.6, 1.13.3


h3. Background

The Kinesis EFO consumer uses an async AWS SDK Netty client to read records 
from Kinesis. When the client is inactive for 30 seconds a 
{{ReadTimeoutException}} is thrown by Netty. The consumer will terminate the 
subscription, backoff and retry. Jobs with high backpressure can result in 
frequent {{ReadTImeoutException}} and the frequent backoff and retry can cause 
unnecessary overhead.

h3. What?

Reduce/eliminate {{ReadTimeoutException}} from the EFO consumer

h3. How?
  
There are 2 improvements to be made:
1. Request next record from the Flink source thread rather than the AWS SDK 
response thread. This means that there will always be space in the input buffer 
queue. The AWS SDK async response thread is no longer blocking on this queue. 
Backpressure is now applied by the Flink source thread rather than the AWS SDK 
thread.
2. Increase the Read Timeout (30s) to be higher than the maximum Shard 
subscription duration (5m) and enable TCP keep alive

h3. References

This has already been implemented and tested in 
[amazon-kinesis-connector-flink|https://github.com/awslabs/amazon-kinesis-connector-flink]:
- [Prevent SDK threads 
blocking|https://github.com/awslabs/amazon-kinesis-connector-flink/pull/40]
- [Increase read timeout and enable TCP 
keepalive|https://github.com/awslabs/amazon-kinesis-connector-flink/pull/42]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to