[jira] [Created] (FLINK-26657) Resilient Kinesis consumption

John Karp (Jira) Tue, 15 Mar 2022 08:23:06 -0700

John Karp created FLINK-26657:
---------------------------------

             Summary: Resilient Kinesis consumption
                 Key: FLINK-26657
                 URL: https://issues.apache.org/jira/browse/FLINK-26657
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / Kinesis
            Reporter: John Karp



Currently, any sort of error reading from a flink stream will quickly result in 
a job-killing error. If the error is not 'recoverable', failure will be 
instant, or if it is 'recoverable', there will be a fixed number of retries 
before the job fails -- and for some operations such as GetRecords, the retries 
can be exhausted in just a few seconds. Furthermore, 
KinesisProxy.isRecoverableSdkClientException() and 
KinesisProxy.isRecoverableException() only recognize very narrow categories of 
errors as even being recoverable.

So for example if a Flink job is aggregating Kinesis streams from multiple 
regions, the Flink job will not be able to make any forward progress on 
processing data from any region if there is a single-region outage, since the 
job will likely fail before any checkpoint can be completed. For some use 
cases, it is better to proceed with processing most of the data, than to wait 
indefinitely for the problematic region to recover.

One mitigation is to increase all of the ConsumerConfig timeouts to be very 
high. However, this will only affect error handling for 'recoverable' 
exceptions, and depending on the nature of the regional failure, the resulting 
errors may not be classified as 'recoverable'.

Proposed mitigation: add a 'soft failure' mode to the Kinesis consumer, where 
most errors arising from Kinesis operations are considered recoverable, and 
there are unlimited retries. (Except for perhaps EFO de-registration, which I'm 
assuming needs to complete in a timely fashion. Also, it looks like 
ExpiredIteratorException needs to bubble up to 
PollingRecordPublisher.getRecords() without retries.)




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (FLINK-26657) Resilient Kinesis consumption

Reply via email to