Hi everyone,

I’ve been following the rapid maturation of *Spark Connect* in the 4.x
release and have been identifying areas where remote execution can reach
parity with Spark Classic .

While the remote execution model elegantly decouples the client from the
JVM, I am concerned about a performance regression in interactive and
high-complexity workloads.

Specifically, the current implementation of *Eager Analysis* (df.columns,
df.schema, etc.) relies on synchronous gRPC round-trips that block the
client thread. In environments with high network latency, these blocking
calls create a "Death by 1000 RPCs" bottleneck—often forcing developers to
write suboptimal, "Connect-specific" code to avoid metadata requests .

*Proposal*:

I propose we introduce a Client-Side Metadata Skip-Layer (Lazy Prefetching)
within the Spark Connect protocol. Key pillars include:

   1.

   *Plan-Piggybacking:* Allowing the *SparkConnectService* to return
   resolved schemas of relations during standard plan execution.
   2.

   *Local Schema Cache:* A configurable client-side cache in the
   *SparkSession* to store resolved schemas.
   3.

   *Batched Analysis API:* An extension to the *AnalyzePlan* protocol to
   allow schema resolution for multiple DataFrames in a single batch call.

This shift would ensure that Spark Connect provides the same "fluid"
interactive experience as Spark Classic, removing the $O(N)$ network
latency overhead for metadata-heavy operations .

I have drafted a full SPIP document ready for review  , which includes the
proposed changes for the *SparkConnectService* and *AnalyzePlan* handlers.

*SPIP Doc:*

https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing

Before I finalize the JIRA, has there been any recent internal discussion
regarding metadata prefetching or batching analysis requests in the current
Spark Connect roadmap ?


Regards,
Vaquar Khan
https://www.linkedin.com/in/vaquar-khan-b695577/

Reply via email to