[
https://issues.apache.org/jira/browse/FLINK-39003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emre Kartoglu updated FLINK-39003:
----------------------------------
Description:
The `millisBehindLatest` metric emitted by the Kinesis source connector
```
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kinesis</artifactId>
<version>5.0.0-1.20</version>
</dependency>
```
appears to be inaccurate on app failover.
Attached are 2 screenshots from AWS Cloudwatch around the same timestamp when
the app (running on Amazon MSF) had continuous restarts. One of the screenshots
shows the `millisBehindLatest` emitted by the DynamoDB (DDB) connector, and the
other one shows that from the Kinesis connector. The metrics from the DDB
connector look fairly accurate, i.e. within 1-2 minutes of outage, we had 81
seconds of latency. However the metric from the Kinesis connector shows ~3854
seconds of latency within 1-2 minutes of app continuous restarts. When the same
metric was 0 or close to 0 right before the restarts.
The issue may well be caused by the AWS managed service, but the accurate DDB
metrics in the same system suggests a possibility that this might be a Kinesis
connector bug.
!Screenshot 2026-01-30 at 12.40.35.png!
was:
The `millisBehindLatest` metric emitted by the Kinesis source connector
```
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kinesis</artifactId>
<version>5.0.0-1.2</version>
</dependency>
```
appears to be inaccurate on app failover.
Attached are 2 screenshots from AWS Cloudwatch around the same timestamp when
the app (running on Amazon MSF) had continuous restarts. One of the screenshots
shows the `millisBehindLatest` emitted by the DynamoDB (DDB) connector, and the
other one shows that from the Kinesis connector. The metrics from the DDB
connector look fairly accurate, i.e. within 1-2 minutes of outage, we had 81
seconds of latency. However the metric from the Kinesis connector shows ~3854
seconds of latency within 1-2 minutes of app continuous restarts. When the same
metric was 0 or close to 0 right before the restarts.
The issue may well be caused by the AWS managed service, but the accurate DDB
metrics in the same system suggests a possibility that this might be a Kinesis
connector bug.
!Screenshot 2026-01-30 at 12.40.35.png!
> Inaccurate millisBehindLatest metric from the Kinesis source connector on
> failover
> ----------------------------------------------------------------------------------
>
> Key: FLINK-39003
> URL: https://issues.apache.org/jira/browse/FLINK-39003
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Kinesis
> Affects Versions: aws-connector-5.0.0
> Reporter: Emre Kartoglu
> Assignee: Emre Kartoglu
> Priority: Major
> Labels: AWS
> Attachments: Screenshot 2026-01-30 at 12.40.35.png, Screenshot
> 2026-01-30 at 12.47.00.png
>
>
> The `millisBehindLatest` metric emitted by the Kinesis source connector
>
> ```
> <dependency>
> <groupId>org.apache.flink</groupId>
> <artifactId>flink-connector-kinesis</artifactId>
> <version>5.0.0-1.20</version>
> </dependency>
> ```
> appears to be inaccurate on app failover.
> Attached are 2 screenshots from AWS Cloudwatch around the same timestamp when
> the app (running on Amazon MSF) had continuous restarts. One of the
> screenshots shows the `millisBehindLatest` emitted by the DynamoDB (DDB)
> connector, and the other one shows that from the Kinesis connector. The
> metrics from the DDB connector look fairly accurate, i.e. within 1-2 minutes
> of outage, we had 81 seconds of latency. However the metric from the Kinesis
> connector shows ~3854 seconds of latency within 1-2 minutes of app continuous
> restarts. When the same metric was 0 or close to 0 right before the restarts.
> The issue may well be caused by the AWS managed service, but the accurate DDB
> metrics in the same system suggests a possibility that this might be a
> Kinesis connector bug.
>
> !Screenshot 2026-01-30 at 12.40.35.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)