otterc opened a new pull request #27240: [SPARK-30512] Added a dedicated boss 
event loop group
URL: https://github.com/apache/spark/pull/27240
 
 
   ### What changes were proposed in this pull request?
   Adding a dedicated boss event loop group to the Netty pipeline in the 
External Shuffle Service to avoid the delay in channel registration.
   ```
      EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
         conf.getModuleName() + "-boss");
       EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
conf.serverThreads(),
       conf.getModuleName() + "-server");
   
       bootstrap = new ServerBootstrap()
         .group(bossGroup, workerGroup)
         .channel(NettyUtils.getServerChannelClass(ioMode))
         .option(ChannelOption.ALLOCATOR, allocator)
   ```
   
   ### Why are the changes needed?
   We have been seeing a large number of SASL authentication (RPC requests) 
timing out with the external shuffle service. 
   ```
   java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
waiting for task.
        at 
org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
        at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
        at 
org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
        at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
        at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
        at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
 
   ```
   The investigation that we have done is described here:
   https://github.com/netty/netty/issues/9890
   
   After adding `LoggingHandler` to the netty pipeline, we saw that the 
registration of the channel was getting delay which is because the worker 
threads are busy with the existing channels. 
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   We have tested the patch on our clusters and with a stress testing tool. 
After this change, we didn't see any SASL requests timing out. Existing unit 
tests pass.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to