[jira] [Created] (FLINK-23573) Tests fail with AdaptiveScheduler due to exceptions in logs trying to offer slots after JobMaster shutdown

Xintong Song (Jira) Sun, 01 Aug 2021 21:12:07 -0700

Xintong Song created FLINK-23573:
------------------------------------

             Summary: Tests fail with AdaptiveScheduler due to exceptions in 
logs trying to offer slots after JobMaster shutdown
                 Key: FLINK-23573
                 URL: https://issues.apache.org/jira/browse/FLINK-23573
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
            Reporter: Xintong Song
             Fix For: 1.14.0



https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20267&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=412

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20396&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=2390

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20454&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=371

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21228&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=7c4a8fb8-eeee-5a77-f518-4176bfae300b&l=2437

The test failed due to exceptions in logs. I executed the following command 
from flink-end-to-end-tests/test-scripts/common.sh on the logs, and it points 
to the RecipientUnreachableException in TM logs. The problem is that, TM 
received extra slot requests from RM after the tasks are finished and slots are 
freed, while the JobMaster it tried to offer slots to had already shutdown.

{code}
$ grep -rv "GroupCoordinatorNotAvailableException" . \
   | grep -v "RetriableCommitFailedException" \
   | grep -v "NoAvailableBrokersException" \
   | grep -v "Async Kafka commit failed" \
   | grep -v "DisconnectException" \
   | grep -v "Cannot connect to ResourceManager right now" \
   | grep -v "AskTimeoutException" \
   | grep -v "WARN  akka.remote.transport.netty.NettyTransport" \
   | grep -v  "WARN  
org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
   | grep -v 'INFO.*AWSErrorCode' \
   | grep -v "RejectedExecutionException" \
   | grep -v "CancellationException" \
   | grep -v "An exception was thrown by an exception handler" \
   | grep -v "Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.yarn.exceptions.YarnException" \
   | grep -v "Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.conf.Configuration" \
   | grep -v "java.lang.NoClassDefFoundError: 
org/apache/hadoop/yarn/exceptions/YarnException" \
   | grep -v "java.lang.NoClassDefFoundError: 
org/apache/hadoop/conf/Configuration" \
   | grep -v "java.lang.Exception: Execution was suspended" \
   | grep -v "java.io.InvalidClassException: 
org.apache.flink.formats.avro.typeutils.AvroSerializer" \
   | grep -v "Caused by: java.lang.Exception: JobManager is shutting down" \
   | grep -v "java.lang.Exception: Artificial failure" \
   | grep -v "org.apache.flink.runtime.checkpoint.CheckpointException" \
   | grep -v "org.elasticsearch.ElasticsearchException" \
   | grep -v "Elasticsearch exception" \
   | grep -v "org.apache.flink.runtime.JobException: Recovery is suppressed" \
   | grep -v "WARN  akka.remote.ReliableDeliverySupervisor" \
   | grep -i "exception"
./flink-vsts-taskexecutor-0-fv-az217-107.log:org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException:
 Could not send message [RemoteFencedMessage(00000000000000000000000000000000, 
RemoteRpcInvocation(null.offerSlots(ResourceID, Collection, Time)))] from 
sender [Actor[akka.tcp://[email protected]:38955/temp/$0b]] to recipient 
[Actor[akka://flink/user/rpc/jobmanager_2#1483449133]], because the recipient 
is unreachable. This can either mean that the recipient has been terminated or 
that the remote RpcService is currently not reachable.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-23573) Tests fail with AdaptiveScheduler due to exceptions in logs trying to offer slots after JobMaster shutdown

Reply via email to