cxzl25 commented on issue #25078: [SPARK-28305][YARN] Request 
GetExecutorLossReason to use a smaller timeout parameter
URL: https://github.com/apache/spark/pull/25078#issuecomment-510924574
 
 
   Yes, I used the following configuration to test successfully in the test 
environment, but not on a large scale in the production environment.
   ```
   spark.rpc.askTimeout=120s
   spark.rpc.io.connectionTimeout=130s
   ```
   
   After the Driver closes the AM connection, the AM does not have a chance to 
reconnect , and the sparkcontext also stops.
   ```scala
   finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
   ```
   
   Or should I take the minimum of three configuration items?
   
   Later I discovered that this may also be a race condition.
   ```RpcOutboxMessage#onTimeout```  call ```removeRpcRequest```
   
https://github.com/apache/spark/blob/9df7587eead82889ca9a4efaaeb0afa55a0157cd/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java#L84-L91
   
   ```IdleStateHandler``` trigger  ```IdleStateEvent```
   ```numOutstandingRequests>0```  May contain rpc request that will be deleted 
by another thread.
   ```isActuallyOverdue=true```
   
https://github.com/apache/spark/blob/9df7587eead82889ca9a4efaaeb0afa55a0157cd/common/network-common/src/main/java/org/apache/spark/network/server/TransportChannelHandler.java#L157-L160
   
   I'm not sure if my example is too extreme, but it does happen in our 
environment.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to