otterc commented on issue #27240: [SPARK-30512] Added a dedicated boss event loop group URL: https://github.com/apache/spark/pull/27240#issuecomment-575372959 @vanzin Even with SPARK-24355, we kept seeing SASL timeout requests. This was reported in https://issues.apache.org/jira/browse/SPARK-29206. Having a dedicated event executor group to process ChunkFetchRequests initially helped, but as the load on our External Shuffle Services increased, we saw SASL requests timing out even with a timeout of 120s. After investigating the issue from different angles which is documented in this netty issue- https://github.com/netty/netty/issues/9890 and the debugging pointers from the netty folks, figured out that even channel registration (binding of a channel to an event loop) got delayed by more than 30s. At peak load time, the worker threads (Server I/O threads) are busy with reading and writing back to the existing channel which is why new channel registration is delayed. Having a dedicated boss event loop group fixed this.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
