[
https://issues.apache.org/jira/browse/FLINK-21596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301683#comment-17301683
]
Dawid Wysakowicz edited comment on FLINK-21596 at 3/15/21, 2:56 PM:
--------------------------------------------------------------------
I think the problem is that the timeout is too aggressive for azure.
TaskManager cannot start in time to take a checkpoint and perform the test. You
can see in the failed logs that very close to the actual timeout the jobs
switches to {{RUNNING}}:
{code}
*01:14:14,988 [Source: Custom Source -> Sink: Unnamed (1/1)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - Source:
Custom Source -> Sink: Unnamed (1/1)#0 (8bfcc29007eca8cef72ff256cd6ff37a)
switched from DEPLOYING to RUNNING.
01:14:15,027 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source -> Sink: Unnamed (1/1) (8bfcc29007eca8cef72ff256cd6ff37a)
switched from DEPLOYING to RUNNING.*
01:14:15,186 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CHECKPOINT) @ 1614820455048 for job
470600686ea2d957e6a81620d925566e.
01:14:15,858 [ main] ERROR
org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase [] -
--------------------------------------------------------------------------------
Test
testAsyncCheckpointFailureTriggerJobFailed(org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase)
failed with:
org.junit.runners.model.TestTimedOutException: test timed out after 10000
milliseconds
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at
org.apache.flink.test.util.TestUtils.submitJobAndWaitForResult(TestUtils.java:62)
at
org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase.testAsyncCheckpointFailureTriggerJobFailed(CheckpointFailureManagerITCase.java:103)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}
I will increase the timeout for the test and slightly the checkpointing
interval.
was (Author: dawidwys):
I think the problem is that the timeout is too aggressive for azure.
TaskManager cannot start in time to take a checkpoint and perform the test. You
can see in the failed logs that very close to the actual timeout the jobs
switches to {{RUNNING}}:
{code}
*01:14:14,988 [Source: Custom Source -> Sink: Unnamed (1/1)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - Source:
Custom Source -> Sink: Unnamed (1/1)#0 (8bfcc29007eca8cef72ff256cd6ff37a)
switched from DEPLOYING to RUNNING.
01:14:15,027 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
Custom Source -> Sink: Unnamed (1/1) (8bfcc29007eca8cef72ff256cd6ff37a)
switched from DEPLOYING to RUNNING.*
01:14:15,186 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CHECKPOINT) @ 1614820455048 for job
470600686ea2d957e6a81620d925566e.
01:14:15,858 [ main] ERROR
org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase [] -
--------------------------------------------------------------------------------
Test
testAsyncCheckpointFailureTriggerJobFailed(org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase)
failed with:
org.junit.runners.model.TestTimedOutException: test timed out after 10000
milliseconds
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at
org.apache.flink.test.util.TestUtils.submitJobAndWaitForResult(TestUtils.java:62)
at
org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase.testAsyncCheckpointFailureTriggerJobFailed(CheckpointFailureManagerITCase.java:103)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}
I will increase the timeout for the test and decrease slightly the
checkpointing interval.
> CheckpointFailureManagerITCase.testAsyncCheckpointFailureTriggerJobFailed
> fail
> -------------------------------------------------------------------------------
>
> Key: FLINK-21596
> URL: https://issues.apache.org/jira/browse/FLINK-21596
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.0
> Reporter: Guowei Ma
> Assignee: Dawid Wysakowicz
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14079&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=c2734c79-73b6-521c-e85a-67c7ecae9107
> {code:java}
> [ERROR]
> testAsyncCheckpointFailureTriggerJobFailed(org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase)
> Time elapsed: 38.623 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 10000
> milliseconds
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.test.util.TestUtils.submitJobAndWaitForResult(TestUtils.java:62)
> at
> org.apache.flink.test.checkpointing.CheckpointFailureManagerITCase.testAsyncCheckpointFailureTriggerJobFailed(CheckpointFailureManagerITCase.java:103)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)