[PR] [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn [spark]

via GitHub Tue, 23 Apr 2024 03:35:48 -0700


TakawaAkirayo opened a new pull request, #46182:
URL: https://github.com/apache/spark/pull/46182


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the 
guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   1. Add configuration `spark.connect.grpc.port.maxRetries` (default 0, no 
retries).
   
      [Before this PR]: The SparkConnectService would fail to start in case of 
port conflicts on Yarn.
      [After this PR]: Allow the internal GRPC server to retry new ports until 
it finds an available port before reaching the maxRetries.
   
   2. Post SparkListenerEvent containing the location of the remote 
SparkConnectService on Yarn.
   
      [Before this PR]: We needed to manually find the final location (host and 
port) of the SparkConnectService on Yarn and then use the SparkConnect Client 
to connect.
      [After this PR]: The location will be posted via SparkListenerEvent
                                                     
`SparkListenerConnectServiceStarted`
                                                     
`SparkListenerConnectServiceEnd`
    Allowing users to register a listener to receive this event and expose it 
by some way like sending it to a third coordinator server.
   
   3. Shutdown SparkPlugins before stopping the ListenerBus.
   
      [Before this PR]: If the SparkConnectService was launched in the 
SparkConnectPlugin way, currently the SparkPlugins would be shutdown after the 
stop of ListenerBus, causing events posted during the shutdown to not be 
delivered to the listener.
      [After this PR]: The SparkPlugins will be shutdown before the stop of 
ListenerBus, ensuring that the ListenerBus remains active during the shutdown 
and the listener can receive the SparkConnectService stop event.
   
   4. Minor method refactoring for 1~3.
   
   
   ### Why are the changes needed?
   #User Story:
   Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
for interactively development via terminal under Yarn Client mode.
   
   However, Yarn Client mode consumes significant local memory if the job is 
heavy, and the total resource pool of k8s for notebooks is limited.
   
   To leverage the abundant resources of our Hadoop cluster for scalability 
purposes, we aim to utilize SparkConnect.
   This allows the driver on Yarn with SparkConnectService started and uses 
SparkConnect client to connect to the remote driver.
   
   To provide a seamless experience with one command startup for both server 
and client, we've wrapped the following processes in one script:
   
   1) Start a local coordinator server (implemented by us internally, not in 
this PR) in the host of jupyter notebook.
   2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port.
      Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
   3) Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
   4) Start the client (pyspark --remote $callback_address) with the remote 
address.
   
   Finally, a remote SparkConnect Server is started on Yarn with a local 
SparkConnect client connected. Users no longer need to start the server 
beforehand and connect to the remote server after they manually explore the 
address on Yarn.
   
   #Problem statement of this change:
   1) The specified port for the SparkConnectService GRPC server might be 
occupied on the node of the Hadoop Cluster.
      To increase the success rate of startup, it needs to retry on conflicts 
rather than fail directly.
   2) Because the final binding port could be uncertain based on 1) when retry 
and the remote address is also unpredictable on Yarn, we need to retrieve the 
address and port programmatically and inject it automatically on the start of 
'pyspark --remote'. To get the address of SparkConnectService on Yarn 
programmatically, the SparkConnectService needs to communicate its location 
back to the launcher side.
   
   
   ### Does this PR introduce _any_ user-facing change?
   1. Add configuration `spark.connect.grpc.port.maxRetries` to enable port 
retries until an available port is found before reaching the maximum number of 
retries.
   
   3. The start and stop events of the SparkConnectService are observable 
through the SparkListener.
      Two new events have been introduced:
      - SparkListenerConnectServiceStarted: the SparkConnectService(with 
address and port) tis online for serving
      - SparkListenerConnectServiceEnd: the SparkConnectService(with address 
and port) is offline
   
   ### How was this patch tested?
   By UT and verified the feature in our production environment by our binary 
build
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn [spark]

Reply via email to