[jira] [Created] (FLINK-29365) Millisecond behind latest jumps after Flink 1.15.2 upgrade

Wilson Wu (Jira) Tue, 20 Sep 2022 12:54:23 -0700

Wilson Wu created FLINK-29365:
---------------------------------

             Summary: Millisecond behind latest jumps after Flink 1.15.2 upgrade
                 Key: FLINK-29365
                 URL: https://issues.apache.org/jira/browse/FLINK-29365
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: 1.15.2
         Environment: Redeployment from 1.14.4 to 1.15.2
            Reporter: Wilson Wu
         Attachments: Screen Shot 2022-09-19 at 2.50.56 PM.png

(First time filling a ticket in Flink community, please let me know if there
are any guidelines I need to follow)

I noticed a very strange behavior with a recent version bump from Flink 1.14.4
to 1.15.2. My project consumes around 30K records per second from a sharded
kinesis stream, and during the version upgrade, it will follow the best
practice to first trigger a savepoint from the running job, start the new job
from the savepoint and then remove the old job. So far so good, and the above
logic has been tested multiple times without any issue for 1.14.4. Usually,
after the version upgrade, our job will have a few minutes delay for
millisecond behind latest, but it will catch up with the speed quickly(within
30mins). Our savepoint is around one hundred MBs big, and our job DAG will
become 90 - 100% busy with some backpressure when we redeploy but after 10-20
minutes it goes back to normal.

Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade
from a running 1.14.4 job, I can see a savepoint has been created and the new
job is running, all the metrics look fine, except suddenly [millisecond behind
the
latest|https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html]
jumps to 10 hours!! and it takes days for my application to catch up with the
kinesis stream latest record. I don't understand why it jumps from 0 second to
10+ hours when we restart the new job. The only main change I introduced with
version bump is to change
[failOnError|https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/connector/kinesis/sink/KinesisStreamsSink.html]
from true to false, but I don't think this is the root cause.

I tried to redeploy the new 1.15.2 job by changing our parallelism, redeploying
a job from 1.15.2 does not introduce a big delay, so I assume the issue above
only happens when we bump version from 1.14.4 to 1.15.2(note the attached
screenshot)? I did try to bump it twice and I see the same 10hrs+ jump in
delay, we do not have changes related to any timezones.

Please let me know if this can be filled as a bug, as I do not have a running
project with all the kinesis setup available that can reproduce the issue.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-29365) Millisecond behind latest jumps after Flink 1.15.2 upgrade

Reply via email to