[
https://issues.apache.org/jira/browse/FLINK-28398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18067497#comment-18067497
]
Souptik Barman commented on FLINK-28398:
----------------------------------------
[~martijnvisser] can this be assigned to me. Not sure if this issue is tracked
anywhere else since the last comment is almost 4 years back.
I can consistently reproduce the issue if I set MINIMAL_CHECKPOINT_TIME = 1L on
my local system on the latest master branch just as [~chesnay] mentioned in
above comment.
Upon debugging it seems that in a case when the canceller scheduled with the
timeout is executing thus aborting the checkpoint before the
TestingMasterHook trigger could be executed which would unlock the
triggerCheckpointLatch.Thus the thread is getting stuck.
I added whenComplete on the first Checkpoint Future and called the trigger
fuction for triggerCheckpointLatch thus the await can exit. Then the test case
is passing even with MINIMAL_CHECKPOINT_TIME = 1L
I only made changes to the test code so it should not affect any other flow. It
seems that this test does does not change any global state so I am hoping it
should not break anything else. This is my first contribution so lacking in
context.
> CheckpointCoordinatorTriggeringTest.discardingTriggeringCheckpointWillExecuteNextCheckpointRequest(
> gets stuck
> --------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-28398
> URL: https://issues.apache.org/jira/browse/FLINK-28398
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 2.1.0
> Reporter: Martijn Visser
> Priority: Major
> Labels: stale-assigned, test-stability
> Fix For: 2.3.0
>
>
> {code:java}
> Jul 01 02:16:55 "main" #1 prio=5 os_prio=0 tid=0x00007fe41000b800 nid=0x5ca2
> in Object.wait() [0x00007fe41a429000]
> Jul 01 02:16:55 java.lang.Thread.State: WAITING (on object monitor)
> Jul 01 02:16:55 at java.lang.Object.wait(Native Method)
> Jul 01 02:16:55 at java.lang.Object.wait(Object.java:502)
> Jul 01 02:16:55 at
> org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:61)
> Jul 01 02:16:55 - locked <0x00000000f096ab58> (a java.lang.Object)
> Jul 01 02:16:55 at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTriggeringTest.discardingTriggeringCheckpointWillExecuteNextCheckpointRequest(CheckpointCoordinatorTriggeringTest.java:731)
> Jul 01 02:16:55 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> Jul 01 02:16:55 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jul 01 02:16:55 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jul 01 02:16:55 at java.lang.reflect.Method.invoke(Method.java:498)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37433&view=logs&j=d89de3df-4600-5585-dadc-9bbc9a5e661c&t=be5a4b15-4b23-56b1-7582-795f58a645a2&l=15207
--
This message was sent by Atlassian Jira
(v8.20.10#820010)