gerashegalov opened a new pull request #31540:
URL: https://github.com/apache/spark/pull/31540


   This PR is a fix for the JLS 17.5.3 violation identified in
   @zsxwing's 19/Feb/19 11:47 comment on the JIRA.
   
   ### What changes were proposed in this pull request?
   - Use a var field to hold the state of the collection accumulator 
   
   ### Why are the changes needed?
   AccumulatorV2 auto-registration of accumulator during readObject doesn't 
work with final fields that are post-processed outside readObject. As it stands 
incompletely initialized objects are published to heartbeat thread. This leads 
to sporadic exceptions knocking out executors which increases the cost of the 
jobs. We observe such failures multiple times every hour.
   
   ### Does this PR introduce _any_ user-facing change?
   None
   
   ### How was this patch tested?
   - this is a concurrency bug that is almost impossible to reproduce as a 
quick unit test. 
   - By trial and error I crafted a command 
https://github.com/NVIDIA/spark-rapids/pull/1688 that reproduces the issue on 
my dev box several times per hour, with the first occurrence often within a few 
minutes    
   - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to