sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379 On the client (executor) side we were seeing lots of timeouts, e.g.: ``` ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.io.IOException: Failed to connect to <node_manager_hostname>/<ip>:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271) at org.apache.spark.executor.Executor.<init>(Executor.scala:121) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: <node_manager_hostname>/<ip>:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` And in the NodeManager logs we were seeing lots of `ClosedChannelException` errors from netty, along with the occasional `java.io.IOException: Broken pipe` error. For example: ``` 2019-05-16 05:13:17,999 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1647907385644, chunkIndex=22}, buffer=FileSegmentManagedBuffer{file=/scratch/hadoop/tmp/nm-local-dir/usercache/<user_name>/appcache/application_1557300039674_635976/blockmgr-0ec1d292-3e75-40bd-afd3-79314f427338/11/shuffle_5_3900_0.data, offset=12387017, length=1235}} to /<ip_addr>:35922; closing connection java.nio.channels.ClosedChannelException ``` We confirmed that the `shuffle-server` threads were still alive in the NM and took thread dumps, but we weren't able to determine what the issue was. In the end we just restarted the NodeManagers and this fixed the problem. I didn't create a JIRA for this just because I don't think the information I have so far is enough to be actionable.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
