[GitHub] [spark] otterc opened a new pull request #31934: [SPARK-32916][FOLLOW-UP][SHUFFLE] Fixes cases of corruption in merged shuffle …

GitBox Mon, 22 Mar 2021 14:37:40 -0700


otterc opened a new pull request #31934:
URL: https://github.com/apache/spark/pull/31934



   
   ### What changes were proposed in this pull request?
   This PR fixes bugs that causes corruption of push-merged blocks when a 
client terminates while pushing block. `RemoteBlockPushResolver` was introduced 
in #30062.
   
   There are 2 scenarios where the merged blocks gets corrupted:
   1. `StreamCallback.onFailure()` is called more than once. Initially we 
assumed that the onFailure callback will be called just once per stream. 
However, we observed that this is called twice when a client connection is 
reset. When the client connection is reset then there are 2 events that get 
triggered in this order.
    - `exceptionCaught`. This event is propagated to `StreamInterceptor`. 
`StreamInterceptor.exceptionCaught()` invokes `callback.onFailure(streamId, 
cause)`. This is the first time StreamCallback.onFailure() will be invoked.
    - `channelInactive`. Since the channel closes, the `channelInactive` event 
gets triggered which again is propagated to `StreamInterceptor`. 
`StreamInterceptor.channelInactive()` invokes `callback.onFailure(streamId, new 
ClosedChannelException())`. This is the second time  StreamCallback.onFailure() 
will be invoked.
   
   2. The flag `isWriting` is set prematurely to true. This introduces an edge 
case where a stream that is trying to merge a duplicate block (created because 
of a speculative task) may interfere with an active stream if the duplicate 
stream fails. 
   
   Also adding additional changes that improve the code.
   
   1.  Using positional writes all the time because this simplifies the code 
and with microbenchmarking haven't seen any performance impact.
   2. Additional minor changes suggested by @mridulm during an internal review.
   
   ### Why are the changes needed?
   These are bug fixes and simplify the code.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added unit tests. I have also tested these changes in Linkedin's internal 
fork on a cluster.
   
   Co-authored-by: Chandni Singh [email protected]
   Co-authored-by: Min Shen [email protected]
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] otterc opened a new pull request #31934: [SPARK-32916][FOLLOW-UP][SHUFFLE] Fixes cases of corruption in merged shuffle …

Reply via email to