Xintong Song created FLINK-23573:
------------------------------------
Summary: Tests fail with AdaptiveScheduler due to exceptions in
logs trying to offer slots after JobMaster shutdown
Key: FLINK-23573
URL: https://issues.apache.org/jira/browse/FLINK-23573
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Reporter: Xintong Song
Fix For: 1.14.0
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20267&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=412
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20396&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=2390
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20454&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=371
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21228&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=7c4a8fb8-eeee-5a77-f518-4176bfae300b&l=2437
The test failed due to exceptions in logs. I executed the following command
from flink-end-to-end-tests/test-scripts/common.sh on the logs, and it points
to the RecipientUnreachableException in TM logs. The problem is that, TM
received extra slot requests from RM after the tasks are finished and slots are
freed, while the JobMaster it tried to offer slots to had already shutdown.
{code}
$ grep -rv "GroupCoordinatorNotAvailableException" . \
| grep -v "RetriableCommitFailedException" \
| grep -v "NoAvailableBrokersException" \
| grep -v "Async Kafka commit failed" \
| grep -v "DisconnectException" \
| grep -v "Cannot connect to ResourceManager right now" \
| grep -v "AskTimeoutException" \
| grep -v "WARN akka.remote.transport.netty.NettyTransport" \
| grep -v "WARN
org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
| grep -v 'INFO.*AWSErrorCode' \
| grep -v "RejectedExecutionException" \
| grep -v "CancellationException" \
| grep -v "An exception was thrown by an exception handler" \
| grep -v "Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.yarn.exceptions.YarnException" \
| grep -v "Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration" \
| grep -v "java.lang.NoClassDefFoundError:
org/apache/hadoop/yarn/exceptions/YarnException" \
| grep -v "java.lang.NoClassDefFoundError:
org/apache/hadoop/conf/Configuration" \
| grep -v "java.lang.Exception: Execution was suspended" \
| grep -v "java.io.InvalidClassException:
org.apache.flink.formats.avro.typeutils.AvroSerializer" \
| grep -v "Caused by: java.lang.Exception: JobManager is shutting down" \
| grep -v "java.lang.Exception: Artificial failure" \
| grep -v "org.apache.flink.runtime.checkpoint.CheckpointException" \
| grep -v "org.elasticsearch.ElasticsearchException" \
| grep -v "Elasticsearch exception" \
| grep -v "org.apache.flink.runtime.JobException: Recovery is suppressed" \
| grep -v "WARN akka.remote.ReliableDeliverySupervisor" \
| grep -i "exception"
./flink-vsts-taskexecutor-0-fv-az217-107.log:org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException:
Could not send message [RemoteFencedMessage(00000000000000000000000000000000,
RemoteRpcInvocation(null.offerSlots(ResourceID, Collection, Time)))] from
sender [Actor[akka.tcp://[email protected]:38955/temp/$0b]] to recipient
[Actor[akka://flink/user/rpc/jobmanager_2#1483449133]], because the recipient
is unreachable. This can either mean that the recipient has been terminated or
that the remote RpcService is currently not reachable.
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)