Hi Everyone,

We would like to start a discussion on the "Spark Connect" proposal. Please
find the links below:

*JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
*SPIP Document* -
https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj

*Excerpt from the document: *

We propose to extend Apache Spark by building on the DataFrame API and the
underlying unresolved logical plans. The DataFrame API is widely used and
makes it very easy to iteratively express complex logic. We will introduce
Spark Connect, a remote option of the DataFrame API that separates the
client from the Spark server. With Spark Connect, Spark will become
decoupled, allowing for built-in remote connectivity: The decoupled client
SDK can be used to run interactive data exploration and connect to the
server for DataFrame operations.

Spark Connect will benefit Spark developers in different ways: The
decoupled architecture will result in improved stability, as clients are
separated from the driver. From the Spark Connect client perspective, Spark
will be (almost) versionless, and thus enable seamless upgradability, as
server APIs can evolve without affecting the client API. The decoupled
client-server architecture can be leveraged to build close integrations
with local developer tooling. Finally, separating the client process from
the Spark server process will improve Spark’s overall security posture by
avoiding the tight coupling of the client inside the Spark runtime
environment.

Spark Connect will strengthen Spark’s position as the modern unified engine
for large-scale data analytics and expand applicability to use cases and
developers we could not reach with the current setup: Spark will become
ubiquitously usable as the DataFrame API can be used with (almost) any
programming language.

We would like to start a discussion on the document and any feedback is
welcome!

Thanks a lot in advance,
Martin

Reply via email to