otterc commented on issue #27240: [SPARK-30512] Added a dedicated boss event 
loop group
URL: https://github.com/apache/spark/pull/27240#issuecomment-577905290
 
 
   @tgravescs In the 30 seconds duration, the worker thread does get time to 
read from the channels that are registered to it.
   
   The code in `NioEventLoop.run()` method. Specifically, this part is relevant.
   ```
   try {
       processSelectedKeys();
   } finally {
       // Ensure we always run tasks.
       final long ioTime = System.nanoTime() - ioStartTime;
       runAllTasks(ioTime * (100 - ioRatio) / ioRatio);
   }
   ```
   `processSelectedKeys` is where the bytes from the registered channels are 
read. The method `runAllTasks`, which processes pending tasks in the event loop 
queue, is executed for a limited amount of time. In sparks case the ioRatio is 
50 so `runAllTasks` is executed for the same amount of time which was recorded 
to execute `processSelectedKeys`.
   
   The problem here is the worker thread is busy with running 
`processSelectedKeys()` and `runAllTasks()` in a loop and doesn't get time to 
accept incoming channel and register it.
   
   In the 30 seconds before a request is timed-out, I have observed multiple 
invocations of processSelectedKeys and runAllTasks.
   ```
   2019-12-13 20:34:12,640 INFO 
org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 
selectedKeys nanos 4590296 millis 4
   2019-12-13 20:34:13,081 INFO 
org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 
runAllTasks millis 441
   2019-12-13 20:34:13,087 INFO 
org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 
selectedKeys nanos 5735343 millis 5
   2019-12-13 20:34:13,149 INFO 
org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 
runAllTasks millis 62
   2019-12-13 20:34:13,150 INFO 
org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 
selectedKeys nanos 1332499 millis 1
   2019-12-13 20:34:13,151 INFO 
org.spark_project.io.netty.channel.nio.NioEventLoop: shuffle-server-4-1 
runAllTasks millis 
   ```
   Once the worker thread reads a SASL message, it processes them in millis or 
less.  The time taken from reading the first byte of SASL message to sending 
its response which I have observed is mostly within milliseconds.
   ```
   2019-12-13 20:34:13,895 INFO org.apache.spark.network.server.LastHandler 
OUTPUT request 8863541504479811769 channel 6fc450d5 request_rec 1576269253894 
flush 1576269253895
   ```
   
   > I thought that timeout was for the send and response from server (in this 
case with the auth response) that would start the negotiation. So just 
accepting it wouldn't get a response.
   
   This is correct. For Sasl messages the time from reading the first byte of 
the message to sending its response is in millis.
   So once the channel gets registered, the event loop will read from it and 
send a response very quickly.
   
   > Does the boss thread block on handing a connection to the worker or does 
it have a queueing mechanism it can keep accepting while others wait on workers?
   
   The event loop assignment to a channel happens in a round robin fashion and 
by default doesn't depend on how busy the event loop is. Once an event loop is 
chosen, I don't know how the connection is handed over to the worker thread. 
Since a boss thread is also a netty event loop, I assume it will be non 
blocking.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to