Martin Grund created SPARK-39375:
------------------------------------
Summary: SPIP: Spark Connect - A client and server interface for
Apache Spark.
Key: SPARK-39375
URL: https://issues.apache.org/jira/browse/SPARK-39375
Project: Spark
Issue Type: Improvement
Components: PySpark, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Martin Grund
Please find the full document for discussion here: [Spark Connect
SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
Below, we have just referenced the introduction.
h2. What are you trying to do?
While Spark is used extensively, it was designed nearly a decade ago, which, in
the age of serverless computing and ubiquitous programming language use, poses
a number of limitations. Most of the limitations stem from the tightly coupled
Spark driver architecture and fact that clusters are typically shared across
users: (1) {*}Lack of built-in remote connectivity{*}: the Spark driver runs
both the client application and scheduler, which results in a heavyweight
architecture that requires proximity to the cluster. There is no built-in
capability to remotely connect to a Spark cluster in languages other than SQL
and users therefore rely on external solutions such as the inactive project
[Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich developer
experience{*}: The current architecture and APIs do not cater for interactive
data exploration (as done with Notebooks), or allow for building out rich
developer experience common in modern code editors. (3) {*}Stability{*}: with
the current shared driver architecture, users causing critical exceptions (e.g.
OOM) bring the whole cluster down for all users. (4) {*}Upgradability{*}: the
current entangling of platform and client APIs (e.g. first and third-party
dependencies in the classpath) does not allow for seamless upgrades between
Spark versions (and with that, hinders new feature adoption).
We propose to overcome these challenges by building on the DataFrame API and
the underlying unresolved logical plans. The DataFrame API is widely used and
makes it very easy to iteratively express complex logic. We will introduce
{_}Spark Connect{_}, a remote option of the DataFrame API that separates the
client from the Spark server. With Spark Connect, Spark will become decoupled,
allowing for built-in remote connectivity: The decoupled client SDK can be used
to run interactive data exploration and connect to the server for DataFrame
operations.
Spark Connect will benefit Spark developers in different ways: The decoupled
architecture will result in improved stability, as clients are separated from
the driver. From the Spark Connect client perspective, Spark will be (almost)
versionless, and thus enable seamless upgradability, as server APIs can evolve
without affecting the client API. The decoupled client-server architecture can
be leveraged to build close integrations with local developer tooling. Finally,
separating the client process from the Spark server process will improve
Spark’s overall security posture by avoiding the tight coupling of the client
inside the Spark runtime environment.
Spark Connect will strengthen Spark’s position as the modern unified engine for
large-scale data analytics and expand applicability to use cases and developers
we could not reach with the current setup: Spark will become ubiquitously
usable as the DataFrame API can be used with (almost) any programming language.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]