otterc commented on a change in pull request #30433:
URL: https://github.com/apache/spark/pull/30433#discussion_r542886903



##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -586,6 +592,7 @@ public void onData(String streamId, ByteBuffer buf) throws 
IOException {
             deferredBufs = null;
             return;
           }
+          abortIfNecessary();

Review comment:
       The client does receive the exception thrown from `onData`. I simulated 
an exception from `onData` at the server for a particular push block.
   Below are the logs of the client. 
   Note: I am running an older version of magnet which still uses 
`ShuffleBlockId` for shuffle push and some other classes are old.
   
   ```
   20/12/11 19:14:48 INFO TransportClientFactory: Successfully created 
connection to ltx1-hcl3213.grid.linkedin.com/10.150.24.33:7337 after 12 ms (5 
ms spent in bootstraps)
   20/12/11 19:14:56 ERROR RetryingBlockFetcher: Failed to fetch block 
shuffle_1_7_7, and will not retry (1 retries)
   org.apache.spark.network.shuffle.BlockPushException: 
^H^@^@^@^_application_1602506816280_53624^@^@^@
   shuffle_1_7_7^@^@^@^Gjava.io.IOException: Destination failed while reading 
stream
           at 
org.apache.spark.network.server.TransportRequestHandler$3.onFailure(TransportRequestHandler.java:244)
           at 
org.apache.spark.network.client.StreamInterceptor.exceptionCaught(StreamInterceptor.java:58)
           at 
org.apache.spark.network.util.TransportFrameDecoder.exceptionCaught(TransportFrameDecoder.java:188)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
           at 
org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
           at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
           at 
org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
           at 
org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
           at 
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
           at 
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
           at 
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
           at 
org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
           at 
org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
           at 
org.spark_project.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.RuntimeException: FAILING this stream
   ```
   
   After this, the client logs show that the server has terminated the 
connection.
   ```
   20/12/11 19:14:57 WARN TransportChannelHandler: Exception in connection from 
ltx1-hcl3412.grid.linkedin.com/10.150.55.78:7337
   java.io.IOException: Connection reset by peer
           at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
           at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
           at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
           at sun.nio.ch.IOUtil.read(IOUtil.java:192)
           at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
           at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
           at 
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
           at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
           at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
           at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
           at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
           at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
           at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
           at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
           at java.lang.Thread.run(Thread.java:748)
   20/12/11 19:14:57 ERROR TransportResponseHandler: Still have 38 requests 
outstanding when connection from 
ltx1-hcl3412.grid.linkedin.com/10.150.55.78:7337 is closed
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to