cxzl25 opened a new pull request #25078: [SPARK-28305][YARN] Request 
GetExecutorLossReason to use a smaller timeout parameter
URL: https://github.com/apache/spark/pull/25078
 
 
   ## What changes were proposed in this pull request?
   Request GetExecutorLossReason to use a smaller timeout parameter.
   
   In some cases, such as NM machine crashes or shuts down,driver ask 
```GetExecutorLossReason```,
   AM ```getCompletedContainersStatuses``` can't get the failure information of 
container.
   
   Because the yarn NM detection timeout is 10 minutes, it is controlled by the 
parameter yarn.resourcemanager.rm.container-allocation.expiry-interval-ms.
   So AM has to wait for 10 minutes to get the cause of the container failure.
   
   Although the driver's ask fails, it will call recover.
   However, due to the 2-minute timeout (spark.network.timeout) configured by 
```IdleStateHandler```, the connection between driver and am is closed, AM 
exits, app finish, driver exits, causing the job to fail.
   
   ## How was this patch tested?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to