Hi Vaquar, Can you enable comments on the doc?
In general I am not against making improvements in this area. However the devil is very much in the details here. Cheers, Herman On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]> wrote: > Hi everyone, > > I’ve been following the rapid maturation of *Spark Connect* in the 4.x > release and have been identifying areas where remote execution can reach > parity with Spark Classic . > > While the remote execution model elegantly decouples the client from the > JVM, I am concerned about a performance regression in interactive and > high-complexity workloads. > > Specifically, the current implementation of *Eager Analysis* (df.columns, > df.schema, etc.) relies on synchronous gRPC round-trips that block the > client thread. In environments with high network latency, these blocking > calls create a "Death by 1000 RPCs" bottleneck—often forcing developers to > write suboptimal, "Connect-specific" code to avoid metadata requests . > > *Proposal*: > > I propose we introduce a Client-Side Metadata Skip-Layer (Lazy > Prefetching) within the Spark Connect protocol. Key pillars include: > > 1. > > *Plan-Piggybacking:* Allowing the *SparkConnectService* to return > resolved schemas of relations during standard plan execution. > 2. > > *Local Schema Cache:* A configurable client-side cache in the > *SparkSession* to store resolved schemas. > 3. > > *Batched Analysis API:* An extension to the *AnalyzePlan* protocol to > allow schema resolution for multiple DataFrames in a single batch call. > > This shift would ensure that Spark Connect provides the same "fluid" > interactive experience as Spark Classic, removing the $O(N)$ network > latency overhead for metadata-heavy operations . > > I have drafted a full SPIP document ready for review , which includes the > proposed changes for the *SparkConnectService* and *AnalyzePlan* handlers. > > *SPIP Doc:* > > > https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing > > Before I finalize the JIRA, has there been any recent internal discussion > regarding metadata prefetching or batching analysis requests in the current > Spark Connect roadmap ? > > > Regards, > Vaquar Khan > https://www.linkedin.com/in/vaquar-khan-b695577/ >
