agrawaldevesh opened a new pull request #29226:
URL: https://github.com/apache/spark/pull/29226


   ### What changes were proposed in this pull request?
   
   It's possible for this test to schedule the 3 tasks on just 2 out of 3
   executors and then end up decommissioning the third one. Since the third
   executor does not have any blocks, none get replicated, thereby failing
   the test.
   
   This condition can be triggered on a loaded system when the Worker
   registrations are sometimes delayed. It does trigger on Github checks
   frequently enough to annoy me to fix it.
   
   I added some logging to diagnose the problem. But the fix is simple:
   Just require that each executor gets the full worker so that no sharing
   is possible.
   
   ### Why are the changes needed?
   
   I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has 
failed several times on my PRs. So fixing it.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, unit test only change.
   
   ### How was this patch tested?
   
   Github checks. Ran this test 100 times about 10 at a time in parallel (to 
create enough load). Ensured that the same running regime, without this fix 
fails at least once.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to