TakawaAkirayo created SPARK-47951:
-------------------------------------
Summary: Support retrieving the real SparkConnectService GRPC
address and port programmatically when running on Yarn
Key: SPARK-47951
URL: https://issues.apache.org/jira/browse/SPARK-47951
Project: Spark
Issue Type: Story
Components: Connect
Affects Versions: 4.0.0
Reporter: TakawaAkirayo
1. {*}User Story{*}:
Our data analysts and data scientists use Jupyter notebooks provisioned on
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark
in the terminal via Yarn Client mode.
However, Yarn Client mode consumes significant local memory if the job is
heavy, and the total resource pool of k8s for notebooks is limited.
To leverage the abundant resources of our Hadoop cluster for scalability
purposes, we aim to utilize SparkConnect.
This allows the driver on Yarn with SparkConnectService started and uses
SparkConnect client to connect to the remote driver.
To provide a seamless experience with one command startup for both server and
client, we've wrapped the following processes in one script:
1). Start a local coordinator server (implemented by us internally, not in this
PR) in the host of jupyter notebook.
2). Start SparkConnectServer by spark-submit via Yarn Cluster mode with
user-input Spark configurations and the local coordinator server's address and
port.
Append an additional listener class in the configuration for
SparkConnectService callback with the actual address and port on Yarn to the
coordinator server.
3). Wait for the coordinator server to receive the address callback from the
SparkConnectService on Yarn and export the real address.
4). Start the client (pyspark --remote $callback_address) with the remote
address.
2. {*}Problem statement of this change{*}:
1). The specified port for the SparkConnectService GRPC server might be
occupied on the node of the Hadoop Cluster.
To increase the success rate of startup, it needs to retry on conflicts
rather than fail directly.
2). Because the final binding port could be uncertain based on #1 when retry
and the remote address is unpredictable on Yarn,
we need to retrieve the address and port programmatically and inject it
automatically on the start of `pyspark --remote`.
To get the address of SparkConnectService on Yarn programmatically, The
SparkConnectService needs to communicate its location back to the launcher side.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]