sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for 
number of exceptions caught in ExternalShuffleBlockHandler
URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379
 
 
   On the client (executor) side we were seeing lots of timeouts, e.g.:
   
   ```
   ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: 
Failed to connect to external shuffle server, will retry 2 more times after 
waiting 5 seconds... 
   java.io.IOException: Failed to connect to <node_manager_hostname>/<ip>:7337
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250)
        at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206)
        at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
        at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
        at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297)
        at 
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271)
        at org.apache.spark.executor.Executor.<init>(Executor.scala:121)
        at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92)
        at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
        at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: <node_manager_hostname>/<ip>:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.net.ConnectException: Connection timed out
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   And in the NodeManager logs we were seeing lots of `ClosedChannelException` 
errors from netty, along with the occasional `java.io.IOException: Broken pipe` 
error. For example:
   
   ```
   2019-05-16 05:13:17,999 ERROR 
org.apache.spark.network.server.TransportRequestHandler: Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1647907385644, 
chunkIndex=22}, 
buffer=FileSegmentManagedBuffer{file=/scratch/hadoop/tmp/nm-local-dir/usercache/<user_name>/appcache/application_1557300039674_635976/blockmgr-0ec1d292-3e75-40bd-afd3-79314f427338/11/shuffle_5_3900_0.data,
 offset=12387017, length=1235}} to /<ip_addr>:35922; closing connection
   java.nio.channels.ClosedChannelException
   ```
   
   We confirmed that the `shuffle-server` threads were still alive in the NM 
and took thread dumps, but we weren't able to determine what the issue was. In 
the end we just restarted the NodeManagers and this fixed the problem.
   
   I didn't create a JIRA for this just because I don't think the information I 
have so far is enough to be actionable.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to