venkateshwaracholan opened a new pull request, #1093:
URL: https://github.com/apache/yunikorn-core/pull/1093

   What is this PR for?
   
   Fix a data race in EventSystemImpl.AddEvent() during shutdown.
   
   Stop() closes and clears ec.channel while holding the event system mutex. 
AddEvent() previously accessed ec.channel without synchronization. During 
TestSchedulerRecoveryQuotaPreemption, scheduler goroutines could still emit 
events while StopAll() shut down the event system, resulting in a race detected 
by go test -race.
   
   The reported race was between:
   
   EventSystemImpl.AddEvent()
   EventSystemImpl.Stop() (close(ec.channel) / ec.channel = nil)
   
   This race could also lead to a send on closed channel panic.
   
   What type of PR is this?
   Bug Fix
   What is the Jira issue?
   
   https://issues.apache.org/jira/browse/YUNIKORN-3296
   
   How should this be tested?
   
   Run:
   
   go test -race ./pkg/events/...
   go test -race -count=20 -run TestSchedulerRecoveryQuotaPreemption 
./pkg/scheduler/tests
   
   A new regression test, TestAddEventConcurrentStop, exercises concurrent 
calls to AddEvent() and Stop().
   
   What does this PR do?
   
   Synchronize AddEvent() with Stop() using the same mutex. If the event system 
has already been stopped or the channel has been cleared, the event is not sent 
and is counted as not channeled.
   
   ### Questions:
   * [ ] - The licenses files need update.
   * [ ] - There is breaking changes for older versions.
   * [ ] - It needs documentation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to