cxzl25 commented on issue #23989: [SPARK-27073][CORE]Handling of IdleStateEvent 
causes the normal connection to close
URL: https://github.com/apache/spark/pull/23989#issuecomment-471164869
 
 
   @liupc 
   
   ```ApplicationMaster``` and ```Driver``` maintain a long connection, 
```IdleStateHandler``` detects every 2 minutes, and closes the current 
connection if there is a timeout request.
   
   This is the case when the connection is closed now.
   ```java
   System.nanoTime() - responseHandler.getTimeOfLastRequestNs() > 
requestTimeoutNs 
   && responseHandler.numOutstandingRequests() > 0
   ```
   When the rpc server thread receives the ```IdleStateEvent```, it is found 
that the time has expired, ``` isActuallyOverdue=true```
   At the same time, the dispatcher event loop thread finds that the executor 
is missing, ```YarnSchedulerEndpoint``` sends a ```GetExecutorLossReason``` 
message to the ```ApplicationMaster```, and the number of 
```TransportResponseHandler.addRpcRequest``` requests is increased.
   Rpc server thread finds that the number of requests is greater than 0, then 
closes the connection, and then ```ApplicationMaster``` receives the connection 
with the driver and closes.
   At this time, ```MonitorThread``` finds that the Yarn application has 
exited, and calls ```SparkContext.stop``` to exit, and the exit code is 0.
   
   The current way of judging will cause some normal connections to be 
accidentally closed.
   
   Because ```addRpcRequest``` is to update ```timeOfLastRequestNs``` first, 
then add to ```outstandingRpcs```
   So in the ```TransportChannelHandler```, you can get 
```numOutstandingRequests``` first, then ```getTimeOfLastRequestNs```, so that 
the request just added will not be judged as a timeout request, and no 
additional synchronization overhead is required.
   
   At first, my idea was to add a synchronization method block, but it might be 
better to this change.
   
   ```ApplicationMaster```  192.168.1.3
   logs
   19/03/05 14:04:37 [dispatcher-event-loop-17] INFO 
ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 
192.168.1.2:61609
   19/03/05 14:04:37 [dispatcher-event-loop-19] INFO 
ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 
192.168.1.2:61609
   Final app status: SUCCESS, exitCode: 0
   
   ```Driver``` 192.168.1.2
   logs
   19/03/05 14:04:37 [rpc-server-3-3] ERROR TransportChannelHandler: Connection 
to /192.168.1.3:43311 has been quiet for 120000 ms while there are outstanding 
requests. Assuming connection is dead; please adjust spark.network.timeout if 
this is wrong.
   19/03/05 14:04:37 [rpc-server-3-3] ERROR TransportResponseHandler: Still 
have 1 requests outstanding when connection from /192.168.1.3:43311 is closed
   [Yarn application state monitor] ERROR YarnClientSchedulerBackend: Yarn 
application has already exited with state FINISHED!
   SparkContext.stop()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to