vanzin opened a new pull request #26535: [SPARK-29905][k8s] Improve pod 
lifecycle manager behavior with dynamic allocation.
URL: https://github.com/apache/spark/pull/26535
 
 
   This issue mainly shows up when you enable dynamic allocation:
   because there are many executor state changes (because of executors
   being requested and starting to run, and later stopped), the lifecycle
   manager class could end up logging information about the same executor
   multiple times, since the different events would cause the same
   executor update to be present in multiple pod snapshots. On top of that,
   it could end up making multiple redundant calls into the API server
   for the same pod.
   
   Another issue was when the config was set to not delete executor
   pods; with dynamic allocation, that means pods keep accumulating
   in the API server, and every time the full sync is done by the
   polling source, all executors, even the finished ones that Spark
   technically does not care about anymore, would be processed.
   
   The change modifies the lifecycle monitor so that it:
   
   - logs executor updates a single time, even if it shows up in
     multiple snapshots, by checking whether the state change
     happened before.
   - marks finished-but-not-deleted-in-k8s executors with a label
     so that they can be easily filtered out.
   
   This reduces the amount of logging done by the lifecycle manager,
   which is a minor thing in general since the logs are at debug level.
   But it also reduces the amount of data that needs to be fetched
   from the API server under certain configurations, and overall
   reduces interaction with the API server when dynamic allocation is on.
   
   There's also a change in the snapshot store to ensure that the
   same subscriber is not called concurrently. That is kind of a bug,
   since it means subscribers could be processing snapshots out of order,
   or even that they could block multiple threads (e.g. the allocator
   callback was synchronized). I actually ran into the "concurrent calls"
   situation in the lifecycle manager during testing, and while it did not
   seem to cause problems, it did make for some head scratching while
   looking at the logs. It seemed safer to fix that.
   
   Unit tests were updated to check for the changes. Also tested in real
   cluster with dynamic allocation on.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to