otterc commented on a change in pull request #30433:
URL: https://github.com/apache/spark/pull/30433#discussion_r542886903
##########
File path:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -586,6 +592,7 @@ public void onData(String streamId, ByteBuffer buf) throws
IOException {
deferredBufs = null;
return;
}
+ abortIfNecessary();
Review comment:
The client does receive the exception thrown from `onData`. I simulated
an exception from `onData` at the server for a particular push block.
Below are the logs of the client.
Note: I am running an older version of magnet which still uses
`ShuffleBlockId` for shuffle push and some other classes are old.
```
20/12/11 19:14:48 INFO TransportClientFactory: Successfully created
connection to ltx1-hcl3213.grid.linkedin.com/10.150.24.33:7337 after 12 ms (5
ms spent in bootstraps)
20/12/11 19:14:56 ERROR RetryingBlockFetcher: Failed to fetch block
shuffle_1_7_7, and will not retry (1 retries)
org.apache.spark.network.shuffle.BlockPushException:
^H^@^@^@^_application_1602506816280_53624^@^@^@
shuffle_1_7_7^@^@^@^Gjava.io.IOException: Destination failed while reading
stream
at
org.apache.spark.network.server.TransportRequestHandler$3.onFailure(TransportRequestHandler.java:244)
at
org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:58)
at
org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:188)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at
org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at
org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
at
org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at
org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at
org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: FAILING this stream
```
After this, the client logs show that the server has terminated the
connection.
```
20/12/11 19:14:57 WARN TransportChannelHandler: Exception in connection from
ltx1-hcl3412.grid.linkedin.com/10.150.55.78:7337
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
20/12/11 19:14:57 ERROR TransportResponseHandler: Still have 38 requests
outstanding when connection from
ltx1-hcl3412.grid.linkedin.com/10.150.55.78:7337 is closed
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]