Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

Martin Grund Tue, 07 Jun 2022 08:36:44 -0700

On Tue, Jun 7, 2022 at 3:54 PM Steve Loughran <ste...@cloudera.com.invalid>
wrote:


>
>
> On Fri, 3 Jun 2022 at 18:46, Martin Grund
> <martin.gr...@databricks.com.invalid> wrote:
>
>> Hi Everyone,
>>
>> We would like to start a discussion on the "Spark Connect" proposal.
>> Please find the links below:
>>
>> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
>> *SPIP Document* -
>> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>>
>> *Excerpt from the document: *
>>
>> We propose to extend Apache Spark by building on the DataFrame API and
>> the underlying unresolved logical plans. The DataFrame API is widely used
>> and makes it very easy to iteratively express complex logic. We will
>> introduce Spark Connect, a remote option of the DataFrame API that
>> separates the client from the Spark server. With Spark Connect, Spark will
>> become decoupled, allowing for built-in remote connectivity: The decoupled
>> client SDK can be used to run interactive data exploration and connect to
>> the server for DataFrame operations.
>>
>> Spark Connect will benefit Spark developers in different ways: The
>> decoupled architecture will result in improved stability, as clients are
>> separated from the driver. From the Spark Connect client perspective, Spark
>> will be (almost) versionless, and thus enable seamless upgradability, as
>> server APIs can evolve without affecting the client API. The decoupled
>> client-server architecture can be leveraged to build close integrations
>> with local developer tooling. Finally, separating the client process from
>> the Spark server process will improve Spark’s overall security posture by
>> avoiding the tight coupling of the client inside the Spark runtime
>> environment.
>>
>
> one key finding on distributed systems since the earliest work since
> Nelson first did the RPC in 1981 is that "seamless upgradability" is
> usually an unrealised vision, especially if things like serialized
> java/spark objects are part of the payload.
>
> if it is a goal, then the tests to validate the versioning would have to
> be a key deliverable. examples: test modules using old versions,
>
> This is particularly a risk with a design which proposes serialising
> logical plans; it may be hard to change planning in future.
>
> Will the protocol include something similar to the DXL plan language
> implemented in Greenplum's orca query optimizer? That's an
> under-appreciated piece of work. If the goal of the protocol is to be long
> lived, it is a design worth considering, not just for its portability but
> because it lets people work on query optimisation as a service.
>
>
In the prototype I've built I'm not actually using the fully specified
logical plans that Spark is using for the query execution before
optimization, but rather something that is closer to the parse plans of a
SQL query. The parse plans follow more closely the relational algebra and
are much less likely to change compared to the actual underlying logical
plan operator. The goal is not to build an endpoint that can receive
optimized plans and directly executes these plans.

For example, all attributes in the plans are referenced as unresolved
attributes and the same is true for functions. This delegates the
responsibility for name resolution etc to the existing implementation that
we're not going to touch instead of trying to replicate it. It is still
possible to provide early feedback to the user because one can always
analyze the specific sub-plan.

Please let me know what you think.


>
> [1]. Orca: A Modular Query Optimizer Architecture for Big Data
>
>  
> https://15721.courses.cs.cmu.edu/spring2017/papers/15-optimizer2/p337-soliman.pdf
> <https://15721.courses.cs.cmu.edu/spring2017/papers/15-optimizer2/p337-soliman.pdf>
>
>
>> Spark Connect will strengthen Spark’s position as the modern unified
>> engine for large-scale data analytics and expand applicability to use cases
>> and developers we could not reach with the current setup: Spark will become
>> ubiquitously usable as the DataFrame API can be used with (almost) any
>> programming language.
>>
>> That's a marketing comment, not a technical one. best left out of ASF
> docs.
>

Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

Reply via email to