Hi Erik, Ruifeng, Herman, Thanks to the suggestion on narrowing the scope, it helped focus the design on a stable Phase 1 Ruifeng, I’ve updated the doc to clarify the distinction between existing schema propagation and the structural RPC bottleneck in mutating transformations.
Herman, would you be open to formally shepherding this SPIP toward a vote? I’d like to target the upcoming 4.x releases if possible. Updated SPIP: https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0 Regards, Viquar khan On Sun, 8 Feb 2026 at 11:03, vaquar khan <[email protected]> wrote: > Hi Ruifeng, > > You are correct regarding filter and limit—I verified in dataframe.py that > these operators do propagate _cached_schema correctly. Thanks for flagging > that.However, this investigation helped isolate the actual structural > bottleneck: schema-mutating transformations. > > Operations like select, withColumn, and join fundamentally alter the plan > structure and cannot use simple instance-propagation. Currently, a loop > executing df.select(...) forces a blocking 277ms RPC for every iteration > because the client treats every new DataFrame instance as a cold start. > > This is where the Plan-ID architecture is essential. By hashing the > unresolved plan, we can detect that select("col") produces a deterministic > schema, even across different DataFrame instances. > > I’ve updated the SPIP to strictly target these unoptimized schema-mutating > workloads. our SIP is critical for interactive performance in data > quality and ETL frameworks. > > Updated doc: > https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0 > > Regards, > Vaquar Khan > https://www.linkedin.com/in/vaquar-khan-b695577/ > > On Sat, 7 Feb 2026 at 22:49, Ruifeng Zheng <[email protected]> wrote: > >> Hi Vaquar, >> >> >> > every time a user does something like.filter() or.limit(), it creates a >> new DataFrame instance with an empty cache. This forces a fresh 277 ms >> AnalyzePlan RPC even if the schema is exactly the same as the parent. >> >> Is this true? I think the cached schema is already propagated in >> operators like `filter` and `limit`, see >> >> >> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L560-L567 >> >> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L792-L795 >> >> >> On Sun, Feb 8, 2026 at 4:44 AM vaquar khan <[email protected]> wrote: >> >>> Hi Erik and Herman, >>> >>> Thanks for the feedback on narrowing the scope. I have updated the SPIP ( >>> SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to >>> focus strictly on Phase 1: Client-Side Plan-ID Caching. >>> >>> I spent some time looking at the pyspark.sql.connect client code and >>> found that while there is already a cache check in dataframe.py:1898, it is >>> strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck >>> we are seeing: every time a user does something like.filter() or.limit(), >>> it creates a new DataFrame instance with an empty cache. This forces a >>> fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the >>> parent. >>> >>> In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls on >>> derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache, that >>> same sequence dropped to 0.25 seconds—a 51x speedup. >>> >>> By focusing only on this caching layer, we can solve the primary >>> performance issue with zero protocol changes and no impact on the >>> user-facing API. I've moved the more complex ideas like background >>> asynchronicity—which Erik noted as a "can of worms" regarding >>> consistency—to a future work section to keep this Phase 1 focused and safe. >>> >>> Updated SPIP: >>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing >>> >>> I would appreciate it if you could take a look at this narrowed version. >>> Is anyone from the PMC open to shepherding this Phase 1? >>> >>> Regards, >>> Vaquar Khan >>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>> >>> On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote: >>> >>>> My 2c — this seems like 3 mostly unrelated proposals that should be >>>> separated out. Caching of schema information in the Spark Connect client >>>> seems uncontroversial (as long as the behavior is controllable / gated >>>> behind a flag), and AFAICT, addresses your concerns. >>>> >>>> Batch resolution is interesting and I can imagine use cases, but it >>>> would require new APIs (AFAICT) and user logic changes, which doesn’t seem >>>> to solve your initial problem statement of performance degradation when >>>> migrating from Classic to Connect. >>>> >>>> Asynchronous resolution is a big can of worms that can fundamentally >>>> change the expected behavior of the APIs. >>>> >>>> I think you will have more luck if you narrowly scope this proposal to >>>> just client-side caching. >>>> >>>> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]> >>>> wrote: >>>> >>>>> Hi Herman, >>>>> >>>>> Sorry for the delay in getting back to you. I’ve finished the >>>>> comprehensive benchmarking >>>>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for >>>>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have >>>>> updated the SPIP draft >>>>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0> >>>>> and JIRA SPARK-55163 >>>>> <https://issues.apache.org/jira/browse/SPARK-55163> ("Asynchronous >>>>> Metadata Resolution & Lazy Prefetching for Spark Connect") with the >>>>> findings. >>>>> >>>>> As we’ve discussed, the transition to the gRPC client-server model >>>>> introduced a significant latency penalty for metadata-heavy workloads. My >>>>> research into a Client-Side Metadata Skip-Layer, using a deterministic >>>>> Plan >>>>> ID strategy, shows that we can bypass these physical network constraints. >>>>> The performance gains actually ended up exceeding our initial projections. >>>>> >>>>> >>>>> *Here are the key results from the testing (conducted on Spark >>>>> 4.0.0-preview):* >>>>> - Baseline Latency Confirmed: We measured a consistent 277 ms >>>>> latency for a single df.columns RPC call. Our analysis shows this is split >>>>> roughly between Catalyst analysis (~27%) and network RTT/serialization >>>>> (~23%). >>>>> >>>>> The Uncached Bottleneck: For a sequence of 50 metadata >>>>> checks—which is common in complex ETL loops or frameworks like Great >>>>> Expectations—the uncached architecture resulted in 13.2 seconds of >>>>> blocking >>>>> overhead. >>>>> >>>>> - Performance with Caching: With the SPARK-45123 Plan ID caching >>>>> enabled, that same 50-call sequence finished in just 0.25 seconds. >>>>> >>>>> - Speedup: This is a *51× speedup for 50 operations*, and my >>>>> projections show this scaling to a *108× speedup for 100 operations*. >>>>> >>>>> - RPC Elimination: By exploiting DataFrame immutability and using >>>>> Plan ID invalidation for correctness, we effectively eliminated 99% of >>>>> metadata RPCs in these iterative flows. >>>>> >>>>> This essentially solves the "Shadow Schema" problem where developers >>>>> were being forced to manually track columns in local lists just to keep >>>>> their notebooks responsive. >>>>> >>>>> Updated SPIP Draft:( >>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0 >>>>> ) >>>>> >>>>> Please take a look when you have a moment. If these results look solid >>>>> to you, I’d like to move this toward a vote. >>>>> >>>>> Best regards, >>>>> >>>>> Viquar Khan >>>>> >>>>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Herman, >>>>>> >>>>>> I have enabled the comments and appreciate your feedback. >>>>>> >>>>>> Regards, >>>>>> Vaquar khan >>>>>> >>>>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Vaquar, >>>>>>> >>>>>>> Can you enable comments on the doc? >>>>>>> >>>>>>> In general I am not against making improvements in this area. >>>>>>> However the devil is very much in the details here. >>>>>>> >>>>>>> Cheers, >>>>>>> Herman >>>>>>> >>>>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I’ve been following the rapid maturation of *Spark Connect* in the >>>>>>>> 4.x release and have been identifying areas where remote execution can >>>>>>>> reach parity with Spark Classic . >>>>>>>> >>>>>>>> While the remote execution model elegantly decouples the client >>>>>>>> from the JVM, I am concerned about a performance regression in >>>>>>>> interactive >>>>>>>> and high-complexity workloads. >>>>>>>> >>>>>>>> Specifically, the current implementation of *Eager Analysis* ( >>>>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC >>>>>>>> round-trips that block the client thread. In environments with high >>>>>>>> network >>>>>>>> latency, these blocking calls create a "Death by 1000 RPCs" >>>>>>>> bottleneck—often forcing developers to write suboptimal, >>>>>>>> "Connect-specific" >>>>>>>> code to avoid metadata requests . >>>>>>>> >>>>>>>> *Proposal*: >>>>>>>> >>>>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy >>>>>>>> Prefetching) within the Spark Connect protocol. Key pillars include: >>>>>>>> >>>>>>>> 1. >>>>>>>> >>>>>>>> *Plan-Piggybacking:* Allowing the *SparkConnectService* to >>>>>>>> return resolved schemas of relations during standard plan execution. >>>>>>>> 2. >>>>>>>> >>>>>>>> *Local Schema Cache:* A configurable client-side cache in the >>>>>>>> *SparkSession* to store resolved schemas. >>>>>>>> 3. >>>>>>>> >>>>>>>> *Batched Analysis API:* An extension to the *AnalyzePlan* >>>>>>>> protocol to allow schema resolution for multiple DataFrames in a >>>>>>>> single >>>>>>>> batch call. >>>>>>>> >>>>>>>> This shift would ensure that Spark Connect provides the same >>>>>>>> "fluid" interactive experience as Spark Classic, removing the >>>>>>>> $O(N)$ network latency overhead for metadata-heavy operations . >>>>>>>> >>>>>>>> I have drafted a full SPIP document ready for review , which >>>>>>>> includes the proposed changes for the *SparkConnectService* and >>>>>>>> *AnalyzePlan* handlers. >>>>>>>> >>>>>>>> *SPIP Doc:* >>>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing >>>>>>>> >>>>>>>> Before I finalize the JIRA, has there been any recent internal >>>>>>>> discussion regarding metadata prefetching or batching analysis >>>>>>>> requests in >>>>>>>> the current Spark Connect roadmap ? >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Vaquar Khan >>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Vaquar Khan >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Vaquar Khan >>>>> >>>>> >>> >>> -- >>> Regards, >>> Vaquar Khan >>> >>> >> >> -- >> Ruifeng Zheng >> E-mail: [email protected] >> > > > -- > Regards, > Vaquar Khan > >
