otterc commented on issue #27240: [SPARK-30512] Added a dedicated boss event loop group URL: https://github.com/apache/spark/pull/27240#issuecomment-577905290 @tgravescs In the 30 seconds duration, the worker thread does get time to read from the channels that are registered to it. The code in `NioEventLoop.run()` method. Specifically, this part is relevant. ``` try { processSelectedKeys(); } finally { // Ensure we always run tasks. final long ioTime = System.nanoTime() - ioStartTime; runAllTasks(ioTime * (100 - ioRatio) / ioRatio); } ``` `processSelectedKeys` is where the bytes from the registered channels are read. The method `runAllTasks`, which processes pending tasks in the event loop queue, is executed for a limited amount of time. In sparks case the ioRatio is 50 so `runAllTasks` is executed for the same amount of time which was recorded to execute `processSelectedKeys`. The problem here is the worker thread is busy with running `processSelectedKeys()` and `runAllTasks()` in a loop and doesn't get time to accept incoming channel and register it. In the 30 seconds before a request is timed-out, I have observed multiple invocations of processSelectedKeys and runAllTasks. ``` 2019-12-13 20:34:12,640 INFO org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 selectedKeys nanos 4590296 millis 4 2019-12-13 20:34:13,081 INFO org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 runAllTasks millis 441 2019-12-13 20:34:13,087 INFO org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 selectedKeys nanos 5735343 millis 5 2019-12-13 20:34:13,149 INFO org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 runAllTasks millis 62 2019-12-13 20:34:13,150 INFO org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 selectedKeys nanos 1332499 millis 1 2019-12-13 20:34:13,151 INFO org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 runAllTasks millis ``` Once the worker thread reads a SASL message, it processes them in millis or less. The time taken from reading the first byte of SASL message to sending its response which I have observed is mostly within milliseconds. ``` 2019-12-13 20:34:13,895 INFO org.apache.spark.network.server.LastHandler OUTPUT request 8863541504479811769 channel 6fc450d5 request_rec 1576269253894 flush 1576269253895 ``` > I thought that timeout was for the send and response from server (in this case with the auth response) that would start the negotiation. So just accepting it wouldn't get a response. This is correct. For Sasl messages the time from reading the first byte of the message to sending its response is in millis. So once the channel gets registered, the event loop will read from it and send a response very quickly. > Does the boss thread block on handing a connection to the worker or does it have a queueing mechanism it can keep accepting while others wait on workers? The event loop assignment to a channel happens in a round robin fashion and by default doesn't depend on how busy the event loop is. Once an event loop is chosen, I don't know how the connection is handed over to the worker thread. Since a boss thread is also a netty event loop, I assume it will be non blocking.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
