[
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839980#comment-17839980
]
TakawaAkirayo commented on SPARK-47952:
---------------------------------------
I'm working on it
> Support retrieving the real SparkConnectService GRPC address and port
> programmatically when running on Yarn
> -----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-47952
> URL: https://issues.apache.org/jira/browse/SPARK-47952
> Project: Spark
> Issue Type: Story
> Components: Connect
> Affects Versions: 4.0.0
> Reporter: TakawaAkirayo
> Priority: Minor
>
> User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes
> significant local memory if the job is heavy, and the total resource pool of
> k8s for notebooks is limited. To leverage the abundant resources of our
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This
> allows the driver on Yarn with SparkConnectService started and uses
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and
> client, we've wrapped the following processes in one script:
> 1. Start a local coordinator server (implemented by us, not in this PR) with
> a specified port.
> 2. Start SparkConnectServer by spark-submit via Yarn Cluster mode with
> user-input Spark configurations and the local coordinator server's address
> and port. Append an additional listener class in the configuration for
> SparkConnectService callback with the actual address and port on Yarn to the
> coordinator server.
> 3. Wait for the coordinator server to receive the address callback from the
> SparkConnectService on Yarn and export the real address.
> 4. Start the client (pyspark --remote) with the remote address.
> Problem statement of this change:
> 1. The specified port for the SparkConnectService GRPC server might be
> occupied on the node of the Hadoop Cluster. To increase the success rate of
> startup, it needs to retry on conflicts rather than fail directly.
> 2. Because the final binding port could be uncertain based on #1 and the
> remote address is unpredictable on Yarn, we need to retrieve the address and
> port programmatically and inject it automatically on the start of `pyspark
> --remote`. The SparkConnectService needs to communicate its location back to
> the launcher side.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]