[
https://issues.apache.org/jira/browse/SPARK-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-23981.
----------------------------------
Resolution: Incomplete
> ShuffleBlockFetcherIterator - Spamming Logs
> -------------------------------------------
>
> Key: SPARK-23981
> URL: https://issues.apache.org/jira/browse/SPARK-23981
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle
> Affects Versions: 2.3.0
> Reporter: David Mollitor
> Priority: Major
> Labels: bulk-closed
>
> If a remote host shuffle service fails, Spark Executors produce a huge amount
> of logging.
> {code:java}
> 2018-04-10 20:24:44,834 INFO [Block Fetch Retry-1]
> shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) -
> Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms
> 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1]
> storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to
> get block(s) from myhost.local:7337
> java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.ConnectException: Connection refused:
> myhost.local/12.13.14.15:7337
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> ... 1 more
> {code}
>
> We can see from the code, that if a block fetch fails, a "listener" is
> updated once for each block. From the error messages previously, it can be
> seen that 1753 blocks were being fetched. However, since the remote host has
> become unavailable, they all fail and every block is alerted on.
>
> {code:java|title=RetryingBlockFetcher.java}
> if (shouldRetry(e)) {
> initiateRetry();
> } else {
> for (String bid : blockIdsToFetch) {
> listener.onBlockFetchFailure(bid, e);
> }
> }
> {code}
> {code:java|title=ShuffleBlockFetcherIterator.scala}
> override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
> logError(s"Failed to get block(s) from
> ${req.address.host}:${req.address.port}", e)
> results.put(new FailureFetchResult(BlockId(blockId), address, e))
> }
> {code}
> So what we get here, is 1753 ERROR stack traces in the logging all printing
> the same message:
> {quote}Failed to get block(s) from myhost.local:7337
> ...
> {quote}
> Perhaps it would be better if the method signature {{onBlockFetchFailure}}
> was changed to accept an entire Collection of block IDs instead of one-by-one.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]