TakawaAkirayo opened a new pull request, #46182:
URL: https://github.com/apache/spark/pull/46182
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the
guideline first in
'common/utils/src/main/resources/error/README.md'.
-->
### What changes were proposed in this pull request?
1. Add configuration `spark.connect.grpc.port.maxRetries` (default 0, no
retries).
[Before this PR]: The SparkConnectService would fail to start in case of
port conflicts on Yarn.
[After this PR]: Allow the internal GRPC server to retry new ports until
it finds an available port before reaching the maxRetries.
2. Post SparkListenerEvent containing the location of the remote
SparkConnectService on Yarn.
[Before this PR]: We needed to manually find the final location (host and
port) of the SparkConnectService on Yarn and then use the SparkConnect Client
to connect.
[After this PR]: The location will be posted via SparkListenerEvent
`SparkListenerConnectServiceStarted`
`SparkListenerConnectServiceEnd`
Allowing users to register a listener to receive this event and expose it
by some way like sending it to a third coordinator server.
3. Shutdown SparkPlugins before stopping the ListenerBus.
[Before this PR]: If the SparkConnectService was launched in the
SparkConnectPlugin way, currently the SparkPlugins would be shutdown after the
stop of ListenerBus, causing events posted during the shutdown to not be
delivered to the listener.
[After this PR]: The SparkPlugins will be shutdown before the stop of
ListenerBus, ensuring that the ListenerBus remains active during the shutdown
and the listener can receive the SparkConnectService stop event.
4. Minor method refactoring for 1~3.
### Why are the changes needed?
#User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark
for interactively development via terminal under Yarn Client mode.
However, Yarn Client mode consumes significant local memory if the job is
heavy, and the total resource pool of k8s for notebooks is limited.
To leverage the abundant resources of our Hadoop cluster for scalability
purposes, we aim to utilize SparkConnect.
This allows the driver on Yarn with SparkConnectService started and uses
SparkConnect client to connect to the remote driver.
To provide a seamless experience with one command startup for both server
and client, we've wrapped the following processes in one script:
1) Start a local coordinator server (implemented by us internally, not in
this PR) in the host of jupyter notebook.
2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with
user-input Spark configurations and the local coordinator server's address and
port.
Append an additional listener class in the configuration for
SparkConnectService callback with the actual address and port on Yarn to the
coordinator server.
3) Wait for the coordinator server to receive the address callback from the
SparkConnectService on Yarn and export the real address.
4) Start the client (pyspark --remote $callback_address) with the remote
address.
Finally, a remote SparkConnect Server is started on Yarn with a local
SparkConnect client connected. Users no longer need to start the server
beforehand and connect to the remote server after they manually explore the
address on Yarn.
#Problem statement of this change:
1) The specified port for the SparkConnectService GRPC server might be
occupied on the node of the Hadoop Cluster.
To increase the success rate of startup, it needs to retry on conflicts
rather than fail directly.
2) Because the final binding port could be uncertain based on 1) when retry
and the remote address is also unpredictable on Yarn, we need to retrieve the
address and port programmatically and inject it automatically on the start of
'pyspark --remote'. To get the address of SparkConnectService on Yarn
programmatically, the SparkConnectService needs to communicate its location
back to the launcher side.
### Does this PR introduce _any_ user-facing change?
1. Add configuration `spark.connect.grpc.port.maxRetries` to enable port
retries until an available port is found before reaching the maximum number of
retries.
3. The start and stop events of the SparkConnectService are observable
through the SparkListener.
Two new events have been introduced:
- SparkListenerConnectServiceStarted: the SparkConnectService(with
address and port) tis online for serving
- SparkListenerConnectServiceEnd: the SparkConnectService(with address
and port) is offline
### How was this patch tested?
By UT and verified the feature in our production environment by our binary
build
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]