[jira] [Commented] (FLINK-11631) TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis

Biao Liu (JIRA) Wed, 17 Jul 2019 00:18:19 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886741#comment-16886741
 ]


Biao Liu commented on FLINK-11631:
----------------------------------

Basically [~till.rohrmann] is right, the root cause is that {{TaskExecutor}} 
does not properly shut down running tasks. This case shuts down a 
{{TaskExecutor}} directly without any cancelation. Currently {{TaskExecutor}} 
does not cancel running tasks when it's shutting down. So there is an exception 
that not all buffers are returned back to {{BufferPool}}.

The confusing part of this case is that the exception is thrown from the 
{{teardown}}. Actually the exception happens in the 
{{miniCluster.terminateTaskExecutor(0);}}.
However this case does not check the {{CompletableFuture}} of termination. And 
{{MiniCluster}} does not remove the terminated {{TaskExecution}}.
In the {{teardown}} part, {{MiniCluster}} shuts down all {{TaskExecutors}} and 
checks the {{CompletableFutures}} of termination. This checking fails since 
there is an unexpected failed future of the terminated {{TaskExecutor}} which 
is still kept by {{MiniCluster}}.

Before https://issues.apache.org/jira/browse/FLINK-11630 is resolved, I think 
we should improve the {{MiniCluster}} by resetting the terminated components 
since there might be a reuse after then.

> TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on 
> Travis
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-11631
>                 URL: https://issues.apache.org/jira/browse/FLINK-11631
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.8.0
>            Reporter: Till Rohrmann
>            Assignee: Biao Liu
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.9.0
>
>
> The {{TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination}} is 
> unstable on Travis. It fails with 
> {code}
> 16:12:04.644 [ERROR] 
> testJobReExecutionAfterTaskExecutorTermination(org.apache.flink.runtime.taskexecutor.TaskExecutorITCase)
>   Time elapsed: 1.257 s  <<< ERROR!
> org.apache.flink.util.FlinkException: Could not close resource.
>       at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorITCase.teardown(TaskExecutorITCase.java:83)
> Caused by: org.apache.flink.util.FlinkException: Error while shutting the 
> TaskExecutor down.
> Caused by: org.apache.flink.util.FlinkException: Could not properly shut down 
> the TaskManager services.
> Caused by: java.lang.IllegalStateException: NetworkBufferPool is not empty 
> after destroying all LocalBufferPools
> {code} 
> https://api.travis-ci.org/v3/job/493221318/log.txt
> The problem seems to be caused by the {{TaskExecutor}} not properly waiting 
> for the termination of all running {{Tasks}}. Due to this, there is a race 
> condition which causes that not all buffers are returned to the 
> {{BufferPool}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-11631) TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis

Reply via email to