[ 
https://issues.apache.org/jira/browse/FLINK-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609254#comment-17609254
 ] 

Hong Liang Teoh commented on FLINK-29365:
-----------------------------------------

Hi [~wilsonwu],

Thanks for reporting this. Indeed, we don't expect millisBehindLatest to jump 
after a version upgrade.

I've tried replicating this issue by upgrading my Kinesis consumer application 
from 1.14.4 to 1.15.2, but was unable to do so. These are the test scenarios I 
tried:
 * tested both Polling and EFO
 * tested consuming streams without resharding as well as stream with multiple 
resharding (mix of shards with records, and without records)
 * tested scaling up job from 2 parallelism to 5 parallelism at the same time 
as upgrade
 * start the consumer from both TRIM_HORIZON and AT_TIMESTAMP (this shouldn't 
matter when restoring from a snapshot anyways)

In all these cases, there was no spike in client side millisBehindLatest / 
server-side IteratorAgeMilliseconds.

 

However, in your case, it seems you were able to reliably replicate the spike 
in MillisBehindLatest. It is likely that there is some difference in our setup 
/ kinesis stream that causes this issue to happen for your case. To help debug 
further, would you be able to provide the following, to help root cause the 
issue?
 * Logs from the taskmanager during the 1.15.2 -> 1.15.2 change (no spike) and 
1.14.4 -> 1.15.2 (with spike). We can compare and contrast the restored state 
for each shard.
 * Kinesis streams have [Enhanced 
metrics|https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html#kinesis-metrics-shard].
 If you enable it, we can see IteratorAgeMilliseconds for each shard - and we 
can see which shardId is seeing the spike in MillisBehindLatest.
 * If possible, it would be nice to have the result from the describe stream 
call (i.e. {*}aws kinesis describe-stream --stream-name <stream_name>{*}), as 
this will help us determine the shard parent-child relations.

 

 

> Millisecond behind latest jumps after Flink 1.15.2 upgrade
> ----------------------------------------------------------
>
>                 Key: FLINK-29365
>                 URL: https://issues.apache.org/jira/browse/FLINK-29365
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kinesis
>    Affects Versions: 1.15.2
>         Environment: Redeployment from 1.14.4 to 1.15.2
>            Reporter: Wilson Wu
>            Priority: Major
>         Attachments: Screen Shot 2022-09-19 at 2.50.56 PM.png
>
>
> (First time filling a ticket in Flink community, please let me know if there 
> are any guidelines I need to follow)
> I noticed a very strange behavior with a recent version bump from Flink 
> 1.14.4 to 1.15.2. My project consumes around 30K records per second from a 
> sharded kinesis stream, and during the version upgrade, it will follow the 
> best practice to first trigger a savepoint from the running job, start the 
> new job from the savepoint and then remove the old job. So far so good, and 
> the above logic has been tested multiple times without any issue for 1.14.4. 
> Usually, after the version upgrade, our job will have a few minutes delay for 
> millisecond behind latest, but it will catch up with the speed quickly(within 
> 30mins). Our savepoint is around one hundred MBs big, and our job DAG will 
> become 90 - 100% busy with some backpressure when we redeploy but after 10-20 
> minutes it goes back to normal.
> Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade 
> from a running 1.14.4 job, I can see a savepoint has been created and the new 
> job is running, all the metrics look fine, except suddenly [millisecond 
> behind the 
> latest|https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html]
>  jumps to 10 hours!! and it takes days for my application to catch up with 
> the kinesis stream latest record. I don't understand why it jumps from 0 
> second to 10+ hours when we restart the new job. The only main change I 
> introduced with version bump is to change 
> [failOnError|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/connector/kinesis/sink/KinesisStreamsSink.html]
>  from true to false, but I don't think this is the root cause.
> I tried to redeploy the new 1.15.2 job by changing our parallelism, 
> redeploying a job from 1.15.2 does not introduce a big delay, so I assume the 
> issue above only happens when we bump version from 1.14.4 to 1.15.2(note the 
> attached screenshot)? I did try to bump it twice and I see the same 10hrs+ 
> jump in delay, we do not have changes related to any timezones.
> Please let me know if this can be filled as a bug, as I do not have a running 
> project with all the kinesis setup available that can reproduce the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to