Wilson Wu created FLINK-29365:
---------------------------------

             Summary: Millisecond behind latest jumps after Flink 1.15.2 upgrade
                 Key: FLINK-29365
                 URL: https://issues.apache.org/jira/browse/FLINK-29365
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: 1.15.2
         Environment: Redeployment from 1.14.4 to 1.15.2
            Reporter: Wilson Wu
         Attachments: Screen Shot 2022-09-19 at 2.50.56 PM.png

(First time filling a ticket in Flink community, please let me know if there 
are any guidelines I need to follow)

I noticed a very strange behavior with a recent version bump from Flink 1.14.4 
to 1.15.2. My project consumes around 30K records per second from a sharded 
kinesis stream, and during the version upgrade, it will follow the best 
practice to first trigger a savepoint from the running job, start the new job 
from the savepoint and then remove the old job. So far so good, and the above 
logic has been tested multiple times without any issue for 1.14.4. Usually, 
after the version upgrade, our job will have a few minutes delay for 
millisecond behind latest, but it will catch up with the speed quickly(within 
30mins). Our savepoint is around one hundred MBs big, and our job DAG will 
become 90 - 100% busy with some backpressure when we redeploy but after 10-20 
minutes it goes back to normal.

Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade 
from a running 1.14.4 job, I can see a savepoint has been created and the new 
job is running, all the metrics look fine, except suddenly [millisecond behind 
the 
latest|https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html] 
jumps to 10 hours!! and it takes days for my application to catch up with the 
kinesis stream latest record. I don't understand why it jumps from 0 second to 
10+ hours when we restart the new job. The only main change I introduced with 
version bump is to change 
[failOnError|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/connector/kinesis/sink/KinesisStreamsSink.html]
 from true to false, but I don't think this is the root cause.

I tried to redeploy the new 1.15.2 job by changing our parallelism, redeploying 
a job from 1.15.2 does not introduce a big delay, so I assume the issue above 
only happens when we bump version from 1.14.4 to 1.15.2(note the attached 
screenshot)? I did try to bump it twice and I see the same 10hrs+ jump in 
delay, we do not have changes related to any timezones.

Please let me know if this can be filled as a bug, as I do not have a running 
project with all the kinesis setup available that can reproduce the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to