TakawaAkirayo opened a new pull request, #45364:
URL: https://github.com/apache/spark/pull/45364
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'common/utils/src/main/resources/error/README.md'.
-->
### What changes were proposed in this pull request?
Add config
spark.scheduler.listenerbus.eventqueue.waitForEventDispatchExitOnStop(default
true).
Before this PR: The event queue will wait for the event to drain completely
on stop.
After this PR: Allow user to control this behavior(wait for event completely
drained or not) by spark config.
### Why are the changes needed?
####Problem statement:
The SparkContext.stop() hung a long time on LiveEventBus.stop() when
listeners slow
####User scenarios:
We have a centralized service with multiple instances to regularly execute
user's scheduled tasks.
For each user task within one service instance, the process is as follows:
1.Create a Spark session directly within the service process with an account
defined in the task.
2.Instantiate listeners by class names and register them with the
SparkContext. The JARs containing the listener classes are uploaded to the
service by the user.
3.Prepare resources.
4.Run user logic (Spark SQL).
5.Stop the Spark session by invoking SparkSession.stop().
In step 5, it will wait for the LiveEventBus to stop, which requires the
remaining events to be completely drained by each listener.
Since the listener is implemented by users and we cannot prevent some heavy
stuffs within the listener on each event, there are cases where a single heavy
job has over 30,000 tasks,
and it could take 30 minutes for the listener to process all the remaining
events, because within the listener, it requires a coarse-grained global lock
and update the internal status to the remote database.
This kind of delay affects other user tasks in the queue. Therefore, from
the server side perspective, we need the guarantee that the stop operation
finishes quickly.
### Does this PR introduce _any_ user-facing change?
Add config
spark.scheduler.listenerbus.eventqueue.waitForEventDispatchExitOnStop.
Default is true, it will wait for the event to drain completely. If set to
false, the LivenEventBus will stop without waitting for the event to drain on
each queue.
### How was this patch tested?
By UT and verified the feature in out production environment
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]