[GitHub] [spark] otterc commented on a change in pull request #33613: [SPARK-36378][SHUFFLE] Switch to using RPCResponse to communicate common block push failures to the client.

GitBox Thu, 05 Aug 2021 04:32:30 -0700


otterc commented on a change in pull request #33613:
URL: https://github.com/apache/spark/pull/33613#discussion_r682876739




##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -471,9 +488,10 @@ public void onData(String streamId, ByteBuffer buf) {
         public void onComplete(String streamId) {
           if (isStaleBlockOrTooLate) {
             // Throw an exception here so the block data is drained from 
channel and server
-            // responds RpcFailure to the client.
-            throw new RuntimeException(String.format("Block %s %s", streamId,
-              
ErrorHandler.BlockPushErrorHandler.TOO_LATE_OR_STALE_BLOCK_PUSH_MESSAGE_SUFFIX));
+            // responds the error code to the client.
+            throw new BlockPushNonFatalFailure(
+              new 
PushBlockNonFatalErrorCode(ErrorCode.TOO_LATE_OR_STALE_BLOCK_PUSH.id())

Review comment:
       I saw how it was being used. My point is that this is not needed any 
more. We can use `BlockPushNoneFatalFailure` instead of 
`StaleBlockPushException`. 
   
   Also,  catching the `StaleBlockPushException` is not really required.  I am 
pointing to the below lines. If `getOrCreateAppShufflePartitionInfo` throws a 
`RuntimeException`, then `receiveBlockDataAsStream` would just fail. In case of 
stale push, it's not really needed that a `StreamCallback` is returned.
   ```
       try {
         partitionInfoBeforeCheck = 
getOrCreateAppShufflePartitionInfo(appShuffleInfo, msg.shuffleId,
           msg.shuffleMergeId, msg.reduceId);
       } catch(StaleBlockPushException sbp) {
         // Set partitionInfoBeforeCheck to null so that stale block push gets 
handled.
         partitionInfoBeforeCheck = null;
       }
    ```
    So, this could just simplified with 
    ```
        AppShufflePartitionInfo partitionInfoBeforeCheck = 
getOrCreateAppShufflePartitionInfo(appShuffleInfo, msg.shuffleId,
           msg.shuffleMergeId, msg.reduceId);
    ```
    If the execution goes beyond this, then that means the push was not stale 
and then it is just too late.

##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -471,9 +488,10 @@ public void onData(String streamId, ByteBuffer buf) {
         public void onComplete(String streamId) {
           if (isStaleBlockOrTooLate) {
             // Throw an exception here so the block data is drained from 
channel and server
-            // responds RpcFailure to the client.
-            throw new RuntimeException(String.format("Block %s %s", streamId,
-              
ErrorHandler.BlockPushErrorHandler.TOO_LATE_OR_STALE_BLOCK_PUSH_MESSAGE_SUFFIX));
+            // responds the error code to the client.
+            throw new BlockPushNonFatalFailure(
+              new 
PushBlockNonFatalErrorCode(ErrorCode.TOO_LATE_OR_STALE_BLOCK_PUSH.id())

Review comment:
       I saw how it was being used. My point is that this is not needed any 
more. We can use `BlockPushNoneFatalFailure` instead of 
`StaleBlockPushException` where it is instantiated. 
   
   Also,  catching the `StaleBlockPushException` is not really required.  I am 
pointing to the below lines. If `getOrCreateAppShufflePartitionInfo` throws a 
`RuntimeException`, then `receiveBlockDataAsStream` would just fail. In case of 
stale push, it's not really needed that a `StreamCallback` is returned.
   ```
       try {
         partitionInfoBeforeCheck = 
getOrCreateAppShufflePartitionInfo(appShuffleInfo, msg.shuffleId,
           msg.shuffleMergeId, msg.reduceId);
       } catch(StaleBlockPushException sbp) {
         // Set partitionInfoBeforeCheck to null so that stale block push gets 
handled.
         partitionInfoBeforeCheck = null;
       }
    ```
    So, this could just simplified with 
    ```
        AppShufflePartitionInfo partitionInfoBeforeCheck = 
getOrCreateAppShufflePartitionInfo(appShuffleInfo, msg.shuffleId,
           msg.shuffleMergeId, msg.reduceId);
    ```
    If the execution goes beyond this, then that means the push was not stale 
and then it is just too late.

##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -471,9 +488,10 @@ public void onData(String streamId, ByteBuffer buf) {
         public void onComplete(String streamId) {
           if (isStaleBlockOrTooLate) {
             // Throw an exception here so the block data is drained from 
channel and server
-            // responds RpcFailure to the client.
-            throw new RuntimeException(String.format("Block %s %s", streamId,
-              
ErrorHandler.BlockPushErrorHandler.TOO_LATE_OR_STALE_BLOCK_PUSH_MESSAGE_SUFFIX));
+            // responds the error code to the client.
+            throw new BlockPushNonFatalFailure(
+              new 
PushBlockNonFatalErrorCode(ErrorCode.TOO_LATE_OR_STALE_BLOCK_PUSH.id())

Review comment:
       @Victsm that's a good point. But we can still replace the usage of 
`StaleBlockPushException` which is created with a message that uses 
`String.format` with `BlockPushNonFatalFailure`.  We can still catch 
`BlockPushNonFatalFailure` and based on the `ErrorCode` enum differentiate 
between stale and too late.
   
   Also, for stale block pushes that are for prior application attempts, isn't 
it  closing the channel better? For clients that are pushing blocks from prior 
application attempt, we just want them to stop pushing all together correct.

##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -471,9 +488,10 @@ public void onData(String streamId, ByteBuffer buf) {
         public void onComplete(String streamId) {
           if (isStaleBlockOrTooLate) {
             // Throw an exception here so the block data is drained from 
channel and server
-            // responds RpcFailure to the client.
-            throw new RuntimeException(String.format("Block %s %s", streamId,
-              
ErrorHandler.BlockPushErrorHandler.TOO_LATE_OR_STALE_BLOCK_PUSH_MESSAGE_SUFFIX));
+            // responds the error code to the client.
+            throw new BlockPushNonFatalFailure(
+              new 
PushBlockNonFatalErrorCode(ErrorCode.TOO_LATE_OR_STALE_BLOCK_PUSH.id())

Review comment:
       @Victsm that's a good point but we can still replace the usage of 
`StaleBlockPushException` which is created with a message that uses 
`String.format` with `BlockPushNonFatalFailure`.  We can still catch 
`BlockPushNonFatalFailure` and based on the `ErrorCode` enum differentiate 
between stale and too late.
   
   Also, for stale block pushes that are for prior application attempts, isn't 
it  closing the channel better? For clients that are pushing blocks from prior 
application attempt, we just want them to stop pushing all together correct?

##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -471,9 +488,10 @@ public void onData(String streamId, ByteBuffer buf) {
         public void onComplete(String streamId) {
           if (isStaleBlockOrTooLate) {
             // Throw an exception here so the block data is drained from 
channel and server
-            // responds RpcFailure to the client.
-            throw new RuntimeException(String.format("Block %s %s", streamId,
-              
ErrorHandler.BlockPushErrorHandler.TOO_LATE_OR_STALE_BLOCK_PUSH_MESSAGE_SUFFIX));
+            // responds the error code to the client.
+            throw new BlockPushNonFatalFailure(
+              new 
PushBlockNonFatalErrorCode(ErrorCode.TOO_LATE_OR_STALE_BLOCK_PUSH.id())

Review comment:
       @Victsm that's a good point but we can still replace the usage of 
`StaleBlockPushException` which is created with a message that uses 
`String.format` with `BlockPushNonFatalFailure`.  We can still catch 
`BlockPushNonFatalFailure` and based on the `ErrorCode` enum differentiate 
between stale and too late.
   
   Also, for stale block pushes that are for prior application attempts, isn't 
closing the channel better? For clients that are pushing blocks from prior 
application attempt, we just want them to stop pushing all together correct?

##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -471,9 +488,10 @@ public void onData(String streamId, ByteBuffer buf) {
         public void onComplete(String streamId) {
           if (isStaleBlockOrTooLate) {
             // Throw an exception here so the block data is drained from 
channel and server
-            // responds RpcFailure to the client.
-            throw new RuntimeException(String.format("Block %s %s", streamId,
-              
ErrorHandler.BlockPushErrorHandler.TOO_LATE_OR_STALE_BLOCK_PUSH_MESSAGE_SUFFIX));
+            // responds the error code to the client.
+            throw new BlockPushNonFatalFailure(
+              new 
PushBlockNonFatalErrorCode(ErrorCode.TOO_LATE_OR_STALE_BLOCK_PUSH.id())

Review comment:
       > For stale push from prior app attempt, it currently indeed throws an 
IllegalArgumentException inside receiveBlockDataAsStream which would lead to 
server closing channels.
   
   I see. That's right. I missed that stale push here is used only in reference 
to the executors which are for the same app attempts but older merged id. 
   
   But anyways, I still think we should replace `StaleBlockPushException` usage 
with `BlockPushNonFatalFailure`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] otterc commented on a change in pull request #33613: [SPARK-36378][SHUFFLE] Switch to using RPCResponse to communicate common block push failures to the client.

Reply via email to