cxzl25 commented on issue #23989: [SPARK-27073][CORE]Handling of IdleStateEvent causes the normal connection to close URL: https://github.com/apache/spark/pull/23989#issuecomment-471164869 @liupc ```ApplicationMaster``` and ```Driver``` maintain a long connection, ```IdleStateHandler``` detects every 2 minutes, and closes the current connection if there is a timeout request. This is the case when the connection is closed now. ```java System.nanoTime() - responseHandler.getTimeOfLastRequestNs() > requestTimeoutNs && responseHandler.numOutstandingRequests() > 0 ``` When the rpc server thread receives the ```IdleStateEvent```, it is found that the time has expired, ``` isActuallyOverdue=true``` At the same time, the dispatcher event loop thread finds that the executor is missing, ```YarnSchedulerEndpoint``` sends a ```GetExecutorLossReason``` message to the ```ApplicationMaster```, and the number of ```TransportResponseHandler.addRpcRequest``` requests is increased. Rpc server thread finds that the number of requests is greater than 0, then closes the connection, and then ```ApplicationMaster``` receives the connection with the driver and closes. At this time, ```MonitorThread``` finds that the Yarn application has exited, and calls ```SparkContext.stop``` to exit, and the exit code is 0. The current way of judging will cause some normal connections to be accidentally closed. Because ```addRpcRequest``` is to update ```timeOfLastRequestNs``` first, then add to ```outstandingRpcs``` So in the ```TransportChannelHandler```, you can get ```numOutstandingRequests``` first, then ```getTimeOfLastRequestNs```, so that the request just added will not be judged as a timeout request, and no additional synchronization overhead is required. At first, my idea was to add a synchronization method block, but it might be better to this change. ```ApplicationMaster``` 192.168.1.3 logs 19/03/05 14:04:37 [dispatcher-event-loop-17] INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 192.168.1.2:61609 19/03/05 14:04:37 [dispatcher-event-loop-19] INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 192.168.1.2:61609 Final app status: SUCCESS, exitCode: 0 ```Driver``` 192.168.1.2 logs 19/03/05 14:04:37 [rpc-server-3-3] ERROR TransportChannelHandler: Connection to /192.168.1.3:43311 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong. 19/03/05 14:04:37 [rpc-server-3-3] ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /192.168.1.3:43311 is closed [Yarn application state monitor] ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! SparkContext.stop()
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
