[ 
https://issues.apache.org/jira/browse/FLINK-23573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song closed FLINK-23573.
--------------------------------
    Resolution: Fixed

Fixed via
- master (1.14): e45723b503b9cc793317a6cad7c0d4c8075c0d16

> Tests fail with AdaptiveScheduler due to exceptions in logs trying to offer 
> slots after JobMaster shutdown
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-23573
>                 URL: https://issues.apache.org/jira/browse/FLINK-23573
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20267&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=412
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20396&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=2390
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20454&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=371
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21228&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=7c4a8fb8-eeee-5a77-f518-4176bfae300b&l=2437
> The test failed due to exceptions in logs. I executed the following command 
> from flink-end-to-end-tests/test-scripts/common.sh on the logs, and it points 
> to the RecipientUnreachableException in TM logs. The problem is that, TM 
> received extra slot requests from RM after the tasks are finished and slots 
> are freed, while the JobMaster it tried to offer slots to had already 
> shutdown.
> {code}
> $ grep -rv "GroupCoordinatorNotAvailableException" . \
>    | grep -v "RetriableCommitFailedException" \
>    | grep -v "NoAvailableBrokersException" \
>    | grep -v "Async Kafka commit failed" \
>    | grep -v "DisconnectException" \
>    | grep -v "Cannot connect to ResourceManager right now" \
>    | grep -v "AskTimeoutException" \
>    | grep -v "WARN  akka.remote.transport.netty.NettyTransport" \
>    | grep -v  "WARN  
> org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
>    | grep -v 'INFO.*AWSErrorCode' \
>    | grep -v "RejectedExecutionException" \
>    | grep -v "CancellationException" \
>    | grep -v "An exception was thrown by an exception handler" \
>    | grep -v "Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.yarn.exceptions.YarnException" \
>    | grep -v "Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.conf.Configuration" \
>    | grep -v "java.lang.NoClassDefFoundError: 
> org/apache/hadoop/yarn/exceptions/YarnException" \
>    | grep -v "java.lang.NoClassDefFoundError: 
> org/apache/hadoop/conf/Configuration" \
>    | grep -v "java.lang.Exception: Execution was suspended" \
>    | grep -v "java.io.InvalidClassException: 
> org.apache.flink.formats.avro.typeutils.AvroSerializer" \
>    | grep -v "Caused by: java.lang.Exception: JobManager is shutting down" \
>    | grep -v "java.lang.Exception: Artificial failure" \
>    | grep -v "org.apache.flink.runtime.checkpoint.CheckpointException" \
>    | grep -v "org.elasticsearch.ElasticsearchException" \
>    | grep -v "Elasticsearch exception" \
>    | grep -v "org.apache.flink.runtime.JobException: Recovery is suppressed" \
>    | grep -v "WARN  akka.remote.ReliableDeliverySupervisor" \
>    | grep -i "exception"
> ./flink-vsts-taskexecutor-0-fv-az217-107.log:org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException:
>  Could not send message 
> [RemoteFencedMessage(00000000000000000000000000000000, 
> RemoteRpcInvocation(null.offerSlots(ResourceID, Collection, Time)))] from 
> sender [Actor[akka.tcp://[email protected]:38955/temp/$0b]] to recipient 
> [Actor[akka://flink/user/rpc/jobmanager_2#1483449133]], because the recipient 
> is unreachable. This can either mean that the recipient has been terminated 
> or that the remote RpcService is currently not reachable.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to