zhouyejoe commented on a change in pull request #33078:
URL: https://github.com/apache/spark/pull/33078#discussion_r670987273
##########
File path:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -778,11 +773,6 @@ public void onComplete(String streamId) throws IOException
{
// IOException to the client. This may increase the chunk size
however the increase is
// still limited because of the limit on the number of
IOExceptions for a
// particular shuffle partition.
- } catch (NullPointerException e) {
- throw new RuntimeException(
- String.format("The merged shuffle partition info for appId %s
shuffleId %s "
- + "reduceId %s has been cleaned up", partitionInfo.appId,
- partitionInfo.shuffleId, partitionInfo.reduceId));
Review comment:
> I agree that this way we'd avoid NPE reference but just wonder it
can't tell the NPE reason explicitly. NPE can not only cause by
`AppShufflePartitionInfo` cleanup but also by potential buggy code. In the way
you proposed, I can't see how it could distinguish these two cases.
>
> How about we add an atomic flag to indicate `AppShufflePartitionInfo`
whether it's cleaned up or not? Then, when never we try to write the partition,
we'd check the flag first.
Setting a `AtomicBoolean` cleanup flag won't work and it would still throw
NPE in some cases here. Suppose Thread 1 is doing the `AtomicBoolean.get`, and
it returns false, which indicates the cleanup is not yet done, then it should
continue to run the dataChannel.flush(). However, context switch happens at
this time and Thread 2 the cleanup thread set the `AtomicBoolean` to true, and
`dataChannel` to `null`. Then context switch back to Thread 1,
`dataChannel.flush()` will still hit NPE.
Currently there is no other codes set the `dataChannel`, `indexFile` and
`metaFile` to null. And they are created from the file paths which is not
expected to be null either. If we really want to distinguish the NPE from the
cleanup or other buggy code before cleanup, we can still follow suggestions
from @mridulm and also add the AtomicBoolean flag to check whether the
dataChannel null is caused by cleanup or not. It will only throw
IllegalStatement exception with error message indicating cleanup has been done,
if the null is caused by cleanup. Otherwise, NPE will still be thrown out which
indicates something else is buggy.
Thoughts? @Ngone51
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]