[jira] [Commented] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

TakawaAkirayo (Jira) Tue, 23 Apr 2024 00:07:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839980#comment-17839980
 ]


TakawaAkirayo commented on SPARK-47952:
---------------------------------------

I'm working on it

> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-47952
>                 URL: https://issues.apache.org/jira/browse/SPARK-47952
>             Project: Spark
>          Issue Type: Story
>          Components: Connect
>    Affects Versions: 4.0.0
>            Reporter: TakawaAkirayo
>            Priority: Minor
>
> User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of our 
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This 
> allows the driver on Yarn with SparkConnectService started and uses 
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and 
> client, we've wrapped the following processes in one script:
> 1. Start a local coordinator server (implemented by us, not in this PR) with 
> a specified port.
> 2. Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
> user-input Spark configurations and the local coordinator server's address 
> and port. Append an additional listener class in the configuration for 
> SparkConnectService callback with the actual address and port on Yarn to the 
> coordinator server.
> 3. Wait for the coordinator server to receive the address callback from the 
> SparkConnectService on Yarn and export the real address.
> 4. Start the client (pyspark --remote) with the remote address.
> Problem statement of this change:
> 1. The specified port for the SparkConnectService GRPC server might be 
> occupied on the node of the Hadoop Cluster. To increase the success rate of 
> startup, it needs to retry on conflicts rather than fail directly.
> 2. Because the final binding port could be uncertain based on #1 and the 
> remote address is unpredictable on Yarn, we need to retrieve the address and 
> port programmatically and inject it automatically on the start of `pyspark 
> --remote`. The SparkConnectService needs to communicate its location back to 
> the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

Reply via email to