[GitHub] [spark] turboFei commented on a change in pull request #24533: [SPARK-27637] For nettyBlockTransferService, when exception occurred when fetching data, check whether relative executor is alive before retry

GitBox Mon, 06 May 2019 18:23:32 -0700

turboFei commented on a change in pull request #24533: [SPARK-27637] For 
nettyBlockTransferService, when exception occurred when fetching data, check 
whether relative executor is alive  before retry
URL: https://github.com/apache/spark/pull/24533#discussion_r281432458


 ##########
 File path: 
core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
 ##########
 @@ -117,11 +125,26 @@ private[spark] class NettyBlockTransferService(
         }
       }
 
+      val executorAliveChecker = new RetryingBlockFetcher.ExecutorAliveChecker 
{
+        override def check(): Boolean = {
 
 Review comment:
   > BTW I vaguely remember that we've already tracked the alive executors at 
the driver side, can you double check?
   
   @cloud-fan  Hi, I have checked the relative code of  
NettyBlockTransferService  and RetryingBlockFetcher.
   
   When the executor is dead during shuffle fetch, the retryingBlockFetcher 
only  just judge whether the exception is IOException and has remaining 
retries,  it does not  track the alive executors at the driver side.
   
   ### case 
   
   I encountered this problem in our production environment recently with 
spark-2.3.2, external shuffle service and dynamicAllocation are  enabled.
   
   When fetching broadcast data by NettyBlockTransferService, it created 
connection to executor_a successfully. Then executor_a was be removed because 
it has been idle for more then relative timeout.
   
   So, NettyBlockTransferService catched an IOException and would retry until 
has no remaining retries.
   
   This is the log of 
executor_a(spark.dynamicAllocation.executorIdleTimeout=40s), we can see that it 
was removed at 2019-04-26 12:18:49.
   ```
       2019-04-26 12:18:09,393 [7363] - INFO  [Executor task launch worker for 
task 1357:Logging$class@54] - Finished task 831.0 in stage 3.0 (TID 1357). 2046 
bytes result sent to driver
       2019-04-26 12:18:09,397 [7367] - INFO  [Executor task launch worker for 
task 1358:Logging$class@54] - Finished task 968.0 in stage 3.0 (TID 1358). 2951 
bytes result sent to driver
       2019-04-26 12:18:49,838 [47808] - ERROR [SIGTERM 
handler:SignalUtils$$anonfun$registerLogger$1$$anonfun$apply$1@43] - RECEIVED 
SIGNAL TERM
       2019-04-26 12:18:49,843 [47813] - INFO  [Thread-3:Logging$class@54] - 
Shutdown hook called
       2019-04-26 12:18:49,844 [47814] - INFO  [Thread-3:Logging$class@54] - 
Shutdown hook called
   ```
   This is the log of NettyBlockTransferService. We can see that after 
executor_a is removed at 2019-04-26 12:18:49, it catched an IOException at 
2019-04-26 12:18:50, and would retry.
   
   ```
       2019-04-26 12:18:49,848 [25708] - INFO  [Executor task launch worker for 
task 1689:Logging$class@54] - Started reading broadcast variable 5
       2019-04-26 12:18:49,906 [25766] - INFO  [Executor task launch worker for 
task 1689:TransportClientFactory@254] - Successfully created connection to 
hadoop3977.jd.163.org/10.196.64.218:38939 after 1 ms (0 ms spent in bootstraps)
       2019-04-26 12:18:50,291 [26151] - WARN  
[shuffle-client-4-1:TransportChannelHandler@78] - Exception in connection from 
hadoop3977.jd.163.org/10.196.64.218:38939
       java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
       ......
       ......
       2019-04-26 12:18:50,296 [26156] - INFO  
[shuffle-client-4-1:RetryingBlockFetcher@164] - Retrying fetch (1/30) for 1 
outstanding blocks after 20000 ms
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] turboFei commented on a change in pull request #24533: [SPARK-27637] For nettyBlockTransferService, when exception occurred when fetching data, check whether relative executor is alive before retry

Reply via email to