[
https://issues.apache.org/jira/browse/FLINK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886741#comment-16886741
]
Biao Liu commented on FLINK-11631:
----------------------------------
Basically [~till.rohrmann] is right, the root cause is that {{TaskExecutor}}
does not properly shut down running tasks. This case shuts down a
{{TaskExecutor}} directly without any cancelation. Currently {{TaskExecutor}}
does not cancel running tasks when it's shutting down. So there is an exception
that not all buffers are returned back to {{BufferPool}}.
The confusing part of this case is that the exception is thrown from the
{{teardown}}. Actually the exception happens in the
{{miniCluster.terminateTaskExecutor(0);}}.
However this case does not check the {{CompletableFuture}} of termination. And
{{MiniCluster}} does not remove the terminated {{TaskExecution}}.
In the {{teardown}} part, {{MiniCluster}} shuts down all {{TaskExecutors}} and
checks the {{CompletableFutures}} of termination. This checking fails since
there is an unexpected failed future of the terminated {{TaskExecutor}} which
is still kept by {{MiniCluster}}.
Before https://issues.apache.org/jira/browse/FLINK-11630 is resolved, I think
we should improve the {{MiniCluster}} by resetting the terminated components
since there might be a reuse after then.
> TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on
> Travis
> ------------------------------------------------------------------------------------
>
> Key: FLINK-11631
> URL: https://issues.apache.org/jira/browse/FLINK-11631
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 1.8.0
> Reporter: Till Rohrmann
> Assignee: Biao Liu
> Priority: Critical
> Labels: test-stability
> Fix For: 1.9.0
>
>
> The {{TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination}} is
> unstable on Travis. It fails with
> {code}
> 16:12:04.644 [ERROR]
> testJobReExecutionAfterTaskExecutorTermination(org.apache.flink.runtime.taskexecutor.TaskExecutorITCase)
> Time elapsed: 1.257 s <<< ERROR!
> org.apache.flink.util.FlinkException: Could not close resource.
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutorITCase.teardown(TaskExecutorITCase.java:83)
> Caused by: org.apache.flink.util.FlinkException: Error while shutting the
> TaskExecutor down.
> Caused by: org.apache.flink.util.FlinkException: Could not properly shut down
> the TaskManager services.
> Caused by: java.lang.IllegalStateException: NetworkBufferPool is not empty
> after destroying all LocalBufferPools
> {code}
> https://api.travis-ci.org/v3/job/493221318/log.txt
> The problem seems to be caused by the {{TaskExecutor}} not properly waiting
> for the termination of all running {{Tasks}}. Due to this, there is a race
> condition which causes that not all buffers are returned to the
> {{BufferPool}}.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)