cxzl25 opened a new pull request #25078: [SPARK-28305][YARN] Request GetExecutorLossReason to use a smaller timeout parameter URL: https://github.com/apache/spark/pull/25078 ## What changes were proposed in this pull request? Request GetExecutorLossReason to use a smaller timeout parameter. In some cases, such as NM machine crashes or shuts down,driver ask ```GetExecutorLossReason```, AM ```getCompletedContainersStatuses``` can't get the failure information of container. Because the yarn NM detection timeout is 10 minutes, it is controlled by the parameter yarn.resourcemanager.rm.container-allocation.expiry-interval-ms. So AM has to wait for 10 minutes to get the cause of the container failure. Although the driver's ask fails, it will call recover. However, due to the 2-minute timeout (spark.network.timeout) configured by ```IdleStateHandler```, the connection between driver and am is closed, AM exits, app finish, driver exits, causing the job to fail. ## How was this patch tested?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
