Hi Vaquar,
> every time a user does something like.filter() or.limit(), it creates a new DataFrame instance with an empty cache. This forces a fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the parent. Is this true? I think the cached schema is already propagated in operators like `filter` and `limit`, see https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L560-L567 https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L792-L795 On Sun, Feb 8, 2026 at 4:44 AM vaquar khan <[email protected]> wrote: > Hi Erik and Herman, > > Thanks for the feedback on narrowing the scope. I have updated the SPIP ( > SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to focus > strictly on Phase 1: Client-Side Plan-ID Caching. > > I spent some time looking at the pyspark.sql.connect client code and found > that while there is already a cache check in dataframe.py:1898, it is > strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck > we are seeing: every time a user does something like.filter() or.limit(), > it creates a new DataFrame instance with an empty cache. This forces a > fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the > parent. > > In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls on > derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache, that > same sequence dropped to 0.25 seconds—a 51x speedup. > > By focusing only on this caching layer, we can solve the primary > performance issue with zero protocol changes and no impact on the > user-facing API. I've moved the more complex ideas like background > asynchronicity—which Erik noted as a "can of worms" regarding > consistency—to a future work section to keep this Phase 1 focused and safe. > > Updated SPIP: > https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing > > I would appreciate it if you could take a look at this narrowed version. > Is anyone from the PMC open to shepherding this Phase 1? > > Regards, > Vaquar Khan > https://www.linkedin.com/in/vaquar-khan-b695577/ > > On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote: > >> My 2c — this seems like 3 mostly unrelated proposals that should be >> separated out. Caching of schema information in the Spark Connect client >> seems uncontroversial (as long as the behavior is controllable / gated >> behind a flag), and AFAICT, addresses your concerns. >> >> Batch resolution is interesting and I can imagine use cases, but it would >> require new APIs (AFAICT) and user logic changes, which doesn’t seem to >> solve your initial problem statement of performance degradation when >> migrating from Classic to Connect. >> >> Asynchronous resolution is a big can of worms that can fundamentally >> change the expected behavior of the APIs. >> >> I think you will have more luck if you narrowly scope this proposal to >> just client-side caching. >> >> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]> >> wrote: >> >>> Hi Herman, >>> >>> Sorry for the delay in getting back to you. I’ve finished the >>> comprehensive benchmarking >>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for >>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have updated >>> the SPIP draft >>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0> >>> and JIRA SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163> >>> ("Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect") >>> with the findings. >>> >>> As we’ve discussed, the transition to the gRPC client-server model >>> introduced a significant latency penalty for metadata-heavy workloads. My >>> research into a Client-Side Metadata Skip-Layer, using a deterministic Plan >>> ID strategy, shows that we can bypass these physical network constraints. >>> The performance gains actually ended up exceeding our initial projections. >>> >>> >>> *Here are the key results from the testing (conducted on Spark >>> 4.0.0-preview):* >>> - Baseline Latency Confirmed: We measured a consistent 277 ms >>> latency for a single df.columns RPC call. Our analysis shows this is split >>> roughly between Catalyst analysis (~27%) and network RTT/serialization >>> (~23%). >>> >>> The Uncached Bottleneck: For a sequence of 50 metadata checks—which >>> is common in complex ETL loops or frameworks like Great Expectations—the >>> uncached architecture resulted in 13.2 seconds of blocking overhead. >>> >>> - Performance with Caching: With the SPARK-45123 Plan ID caching >>> enabled, that same 50-call sequence finished in just 0.25 seconds. >>> >>> - Speedup: This is a *51× speedup for 50 operations*, and my >>> projections show this scaling to a *108× speedup for 100 operations*. >>> >>> - RPC Elimination: By exploiting DataFrame immutability and using >>> Plan ID invalidation for correctness, we effectively eliminated 99% of >>> metadata RPCs in these iterative flows. >>> >>> This essentially solves the "Shadow Schema" problem where developers >>> were being forced to manually track columns in local lists just to keep >>> their notebooks responsive. >>> >>> Updated SPIP Draft:( >>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0 >>> ) >>> >>> Please take a look when you have a moment. If these results look solid >>> to you, I’d like to move this toward a vote. >>> >>> Best regards, >>> >>> Viquar Khan >>> >>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> wrote: >>> >>>> Hi Herman, >>>> >>>> I have enabled the comments and appreciate your feedback. >>>> >>>> Regards, >>>> Vaquar khan >>>> >>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev < >>>> [email protected]> wrote: >>>> >>>>> Hi Vaquar, >>>>> >>>>> Can you enable comments on the doc? >>>>> >>>>> In general I am not against making improvements in this area. However >>>>> the devil is very much in the details here. >>>>> >>>>> Cheers, >>>>> Herman >>>>> >>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I’ve been following the rapid maturation of *Spark Connect* in the >>>>>> 4.x release and have been identifying areas where remote execution can >>>>>> reach parity with Spark Classic . >>>>>> >>>>>> While the remote execution model elegantly decouples the client from >>>>>> the JVM, I am concerned about a performance regression in interactive and >>>>>> high-complexity workloads. >>>>>> >>>>>> Specifically, the current implementation of *Eager Analysis* ( >>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC round-trips >>>>>> that block the client thread. In environments with high network latency, >>>>>> these blocking calls create a "Death by 1000 RPCs" bottleneck—often >>>>>> forcing >>>>>> developers to write suboptimal, "Connect-specific" code to avoid metadata >>>>>> requests . >>>>>> >>>>>> *Proposal*: >>>>>> >>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy >>>>>> Prefetching) within the Spark Connect protocol. Key pillars include: >>>>>> >>>>>> 1. >>>>>> >>>>>> *Plan-Piggybacking:* Allowing the *SparkConnectService* to return >>>>>> resolved schemas of relations during standard plan execution. >>>>>> 2. >>>>>> >>>>>> *Local Schema Cache:* A configurable client-side cache in the >>>>>> *SparkSession* to store resolved schemas. >>>>>> 3. >>>>>> >>>>>> *Batched Analysis API:* An extension to the *AnalyzePlan* >>>>>> protocol to allow schema resolution for multiple DataFrames in a >>>>>> single >>>>>> batch call. >>>>>> >>>>>> This shift would ensure that Spark Connect provides the same "fluid" >>>>>> interactive experience as Spark Classic, removing the $O(N)$ network >>>>>> latency overhead for metadata-heavy operations . >>>>>> >>>>>> I have drafted a full SPIP document ready for review , which >>>>>> includes the proposed changes for the *SparkConnectService* and >>>>>> *AnalyzePlan* handlers. >>>>>> >>>>>> *SPIP Doc:* >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing >>>>>> >>>>>> Before I finalize the JIRA, has there been any recent internal >>>>>> discussion regarding metadata prefetching or batching analysis requests >>>>>> in >>>>>> the current Spark Connect roadmap ? >>>>>> >>>>>> >>>>>> Regards, >>>>>> Vaquar Khan >>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>> >>>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> >>>> >>> >>> -- >>> Regards, >>> Vaquar Khan >>> >>> > > -- > Regards, > Vaquar Khan > > -- Ruifeng Zheng E-mail: [email protected]
