Hi Herman, I have enabled the comments and appreciate your feedback.
Regards, Vaquar khan On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <[email protected]> wrote: > Hi Vaquar, > > Can you enable comments on the doc? > > In general I am not against making improvements in this area. However the > devil is very much in the details here. > > Cheers, > Herman > > On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]> wrote: > >> Hi everyone, >> >> I’ve been following the rapid maturation of *Spark Connect* in the 4.x >> release and have been identifying areas where remote execution can reach >> parity with Spark Classic . >> >> While the remote execution model elegantly decouples the client from the >> JVM, I am concerned about a performance regression in interactive and >> high-complexity workloads. >> >> Specifically, the current implementation of *Eager Analysis* (df.columns, >> df.schema, etc.) relies on synchronous gRPC round-trips that block the >> client thread. In environments with high network latency, these blocking >> calls create a "Death by 1000 RPCs" bottleneck—often forcing developers to >> write suboptimal, "Connect-specific" code to avoid metadata requests . >> >> *Proposal*: >> >> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy >> Prefetching) within the Spark Connect protocol. Key pillars include: >> >> 1. >> >> *Plan-Piggybacking:* Allowing the *SparkConnectService* to return >> resolved schemas of relations during standard plan execution. >> 2. >> >> *Local Schema Cache:* A configurable client-side cache in the >> *SparkSession* to store resolved schemas. >> 3. >> >> *Batched Analysis API:* An extension to the *AnalyzePlan* protocol to >> allow schema resolution for multiple DataFrames in a single batch call. >> >> This shift would ensure that Spark Connect provides the same "fluid" >> interactive experience as Spark Classic, removing the $O(N)$ network >> latency overhead for metadata-heavy operations . >> >> I have drafted a full SPIP document ready for review , which includes >> the proposed changes for the *SparkConnectService* and *AnalyzePlan* >> handlers. >> >> *SPIP Doc:* >> >> >> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing >> >> Before I finalize the JIRA, has there been any recent internal discussion >> regarding metadata prefetching or batching analysis requests in the current >> Spark Connect roadmap ? >> >> >> Regards, >> Vaquar Khan >> https://www.linkedin.com/in/vaquar-khan-b695577/ >> > -- Regards, Vaquar Khan
