turboFei commented on a change in pull request #24533: [SPARK-27637] For
nettyBlockTransferService, when exception occurred when fetching data, check
whether relative executor is alive before retry
URL: https://github.com/apache/spark/pull/24533#discussion_r281432458
##########
File path:
core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
##########
@@ -117,11 +125,26 @@ private[spark] class NettyBlockTransferService(
}
}
+ val executorAliveChecker = new RetryingBlockFetcher.ExecutorAliveChecker
{
+ override def check(): Boolean = {
Review comment:
> BTW I vaguely remember that we've already tracked the alive executors at
the driver side, can you double check?
@cloud-fan Hi, I have checked the relative code of
NettyBlockTransferService and RetryingBlockFetcher.
When the executor is dead during shuffle fetch, the retryingBlockFetcher
only just judge whether the exception is IOException and has remaining
retries, it does not track the alive executors at the driver side.
### case
I encountered this problem in our production environment recently with
spark-2.3.2, external shuffle service and dynamicAllocation are enabled.
When fetching broadcast data by NettyBlockTransferService, it created
connection to executor_a successfully. Then executor_a was be removed because
it has been idle for more then relative timeout.
So, NettyBlockTransferService catched an IOException and would retry until
has no remaining retries.
This is the log of
executor_a(spark.dynamicAllocation.executorIdleTimeout=40s), we can see that it
was removed at 2019-04-26 12:18:49.
```
2019-04-26 12:18:09,393 [7363] - INFO [Executor task launch worker for
task 1357:Logging$class@54] - Finished task 831.0 in stage 3.0 (TID 1357). 2046
bytes result sent to driver
2019-04-26 12:18:09,397 [7367] - INFO [Executor task launch worker for
task 1358:Logging$class@54] - Finished task 968.0 in stage 3.0 (TID 1358). 2951
bytes result sent to driver
2019-04-26 12:18:49,838 [47808] - ERROR [SIGTERM
handler:SignalUtils$$anonfun$registerLogger$1$$anonfun$apply$1@43] - RECEIVED
SIGNAL TERM
2019-04-26 12:18:49,843 [47813] - INFO [Thread-3:Logging$class@54] -
Shutdown hook called
2019-04-26 12:18:49,844 [47814] - INFO [Thread-3:Logging$class@54] -
Shutdown hook called
```
This is the log of NettyBlockTransferService. We can see that after
executor_a is removed at 2019-04-26 12:18:49, it catched an IOException at
2019-04-26 12:18:50, and would retry.
```
2019-04-26 12:18:49,848 [25708] - INFO [Executor task launch worker for
task 1689:Logging$class@54] - Started reading broadcast variable 5
2019-04-26 12:18:49,906 [25766] - INFO [Executor task launch worker for
task 1689:TransportClientFactory@254] - Successfully created connection to
hadoop3977.jd.163.org/10.196.64.218:38939 after 1 ms (0 ms spent in bootstraps)
2019-04-26 12:18:50,291 [26151] - WARN
[shuffle-client-4-1:TransportChannelHandler@78] - Exception in connection from
hadoop3977.jd.163.org/10.196.64.218:38939
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
......
......
2019-04-26 12:18:50,296 [26156] - INFO
[shuffle-client-4-1:RetryingBlockFetcher@164] - Retrying fetch (1/30) for 1
outstanding blocks after 20000 ms
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]