zhouyejoe commented on a change in pull request #33078:
URL: https://github.com/apache/spark/pull/33078#discussion_r670987273



##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -778,11 +773,6 @@ public void onComplete(String streamId) throws IOException 
{
               // IOException to the client. This may increase the chunk size 
however the increase is
               // still limited because of the limit on the number of 
IOExceptions for a
               // particular shuffle partition.
-            } catch (NullPointerException e) {
-              throw new RuntimeException(
-                String.format("The merged shuffle partition info for appId %s 
shuffleId %s "
-                  + "reduceId %s has been cleaned up", partitionInfo.appId,
-                  partitionInfo.shuffleId, partitionInfo.reduceId));

Review comment:
       > I agree that this way we'd avoid NPE reference but just wonder it 
can't tell the NPE reason explicitly. NPE can not only cause by 
`AppShufflePartitionInfo` cleanup but also by potential buggy code. In the way 
you proposed, I can't see how it could distinguish these two cases.
   > 
   > How about we add an atomic flag to indicate `AppShufflePartitionInfo` 
whether it's cleaned up or not? Then, when never we try to write the partition, 
we'd check the flag first.
   
   Setting a `AtomicBoolean` cleanup flag won't work and it would still throw 
NPE in some cases here. Suppose Thread 1 is doing the `AtomicBoolean.get`, and 
it returns false, which indicates the cleanup is not yet done, then it should 
continue to run the dataChannel.flush(). However, context switch happens at 
this time and Thread 2 the cleanup thread set the `AtomicBoolean` to true, and 
`dataChannel` to `null`. Then context switch back to Thread 1, 
`dataChannel.flush()` will still hit NPE.
   
   Currently there is no other codes set the `dataChannel`, `indexFile` and 
`metaFile` to null. And they are created from the file paths which is not 
expected to be null either. If we really want to distinguish the NPE from the 
cleanup or other buggy code before cleanup, we can still follow suggestions 
from @mridulm and also add the AtomicBoolean flag to check whether the 
dataChannel null is caused by cleanup or not. It will only throw 
IllegalStatement exception with error message indicating cleanup has been done, 
if the null is caused by cleanup. Otherwise, NPE will still be thrown out which 
indicates something else is buggy.
   
   Thoughts? @Ngone51 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to