Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

vaquar khan Mon, 09 Feb 2026 19:39:37 -0800

Hi Erik, Ruifeng, Herman,

Thanks to the suggestion on narrowing the scope, it helped focus the design
on a stable Phase 1
Ruifeng, I’ve updated the doc to clarify the distinction between existing
schema propagation and the structural RPC bottleneck in mutating
transformations.


Herman, would you be open to formally shepherding this SPIP toward a vote?


I’d like to target the upcoming 4.x releases if possible.
Updated SPIP:
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0

Regards,
Viquar khan

On Sun, 8 Feb 2026 at 11:03, vaquar khan <[email protected]> wrote:

> Hi Ruifeng,
>
> You are correct regarding filter and limit—I verified in dataframe.py that
> these operators do propagate _cached_schema correctly. Thanks for flagging
> that.However, this investigation helped isolate the actual structural
> bottleneck: schema-mutating transformations.
>
> Operations like select, withColumn, and join fundamentally alter the plan
> structure and cannot use simple instance-propagation. Currently, a loop
> executing df.select(...) forces a blocking 277ms RPC for every iteration
> because the client treats every new DataFrame instance as a cold start.
>
> This is where the Plan-ID architecture is essential. By hashing the
> unresolved plan, we can detect that select("col") produces a deterministic
> schema, even across different DataFrame instances.
>
> I’ve updated the SPIP to strictly target these unoptimized schema-mutating
> workloads. our SIP  is  critical for interactive performance in data
> quality and ETL frameworks.
>
> Updated doc:
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>
> Regards,
> Vaquar Khan
> https://www.linkedin.com/in/vaquar-khan-b695577/
>
> On Sat, 7 Feb 2026 at 22:49, Ruifeng Zheng <[email protected]> wrote:
>
>> Hi Vaquar,
>>
>>
>> > every time a user does something like.filter() or.limit(), it creates a
>> new DataFrame instance with an empty cache. This forces a fresh 277 ms
>> AnalyzePlan RPC even if the schema is exactly the same as the parent.
>>
>> Is this true? I think the cached schema is already propagated in
>> operators like `filter` and `limit`, see
>>
>>
>> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L560-L567
>>
>> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L792-L795
>>
>>
>> On Sun, Feb 8, 2026 at 4:44 AM vaquar khan <[email protected]> wrote:
>>
>>> Hi Erik and Herman,
>>>
>>> Thanks for the feedback on narrowing the scope. I have updated the SPIP (
>>> SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to
>>> focus strictly on Phase 1: Client-Side Plan-ID Caching.
>>>
>>> I spent some time looking at the pyspark.sql.connect client code and
>>> found that while there is already a cache check in dataframe.py:1898, it is
>>> strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck
>>> we are seeing: every time a user does something like.filter() or.limit(),
>>> it creates a new DataFrame instance with an empty cache. This forces a
>>> fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the
>>> parent.
>>>
>>> In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls on
>>> derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache, that
>>> same sequence dropped to 0.25 seconds—a 51x speedup.
>>>
>>> By focusing only on this caching layer, we can solve the primary
>>> performance issue with zero protocol changes and no impact on the
>>> user-facing API. I've moved the more complex ideas like background
>>> asynchronicity—which Erik noted as a "can of worms" regarding
>>> consistency—to a future work section to keep this Phase 1 focused and safe.
>>>
>>> Updated SPIP:
>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>
>>> I would appreciate it if you could take a look at this narrowed version.
>>> Is anyone from the PMC open to shepherding this Phase 1?
>>>
>>> Regards,
>>> Vaquar Khan
>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>
>>> On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote:
>>>
>>>> My 2c — this seems like 3 mostly unrelated proposals that should be
>>>> separated out. Caching of schema information in the Spark Connect client
>>>> seems uncontroversial (as long as the behavior is controllable / gated
>>>> behind a flag), and AFAICT, addresses your concerns.
>>>>
>>>> Batch resolution is interesting and I can imagine use cases, but it
>>>> would require new APIs (AFAICT) and user logic changes, which doesn’t seem
>>>> to solve your initial problem statement of performance degradation when
>>>> migrating from Classic to Connect.
>>>>
>>>> Asynchronous resolution is a big can of worms that can fundamentally
>>>> change the expected behavior of the APIs.
>>>>
>>>> I think you will have more luck if you narrowly scope this proposal to
>>>> just client-side caching.
>>>>
>>>> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Herman,
>>>>>
>>>>> Sorry for the delay in getting back to you. I’ve finished the
>>>>> comprehensive benchmarking
>>>>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for
>>>>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have
>>>>> updated the SPIP draft
>>>>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0>
>>>>> and JIRA SPARK-55163
>>>>> <https://issues.apache.org/jira/browse/SPARK-55163> ("Asynchronous
>>>>> Metadata Resolution & Lazy Prefetching for Spark Connect") with the
>>>>> findings.
>>>>>
>>>>> As we’ve discussed, the transition to the gRPC client-server model
>>>>> introduced a significant latency penalty for metadata-heavy workloads. My
>>>>> research into a Client-Side Metadata Skip-Layer, using a deterministic 
>>>>> Plan
>>>>> ID strategy, shows that we can bypass these physical network constraints.
>>>>> The performance gains actually ended up exceeding our initial projections.
>>>>>
>>>>>
>>>>> *Here are the key results from the testing (conducted on Spark
>>>>> 4.0.0-preview):*
>>>>>     - Baseline Latency Confirmed: We measured a consistent 277 ms
>>>>> latency for a single df.columns RPC call. Our analysis shows this is split
>>>>> roughly between Catalyst analysis (~27%) and network RTT/serialization
>>>>> (~23%).
>>>>>
>>>>>     The Uncached Bottleneck: For a sequence of 50 metadata
>>>>> checks—which is common in complex ETL loops or frameworks like Great
>>>>> Expectations—the uncached architecture resulted in 13.2 seconds of 
>>>>> blocking
>>>>> overhead.
>>>>>
>>>>>     - Performance with Caching: With the SPARK-45123 Plan ID caching
>>>>> enabled, that same 50-call sequence finished in just 0.25 seconds.
>>>>>
>>>>>     - Speedup: This is a *51× speedup for 50 operations*, and my
>>>>> projections show this scaling to a *108× speedup for 100 operations*.
>>>>>
>>>>>     - RPC Elimination: By exploiting DataFrame immutability and using
>>>>> Plan ID invalidation for correctness, we effectively eliminated 99% of
>>>>> metadata RPCs in these iterative flows.
>>>>>
>>>>> This essentially solves the "Shadow Schema" problem where developers
>>>>> were being forced to manually track columns in local lists just to keep
>>>>> their notebooks responsive.
>>>>>
>>>>> Updated SPIP Draft:(
>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>>>>> )
>>>>>
>>>>> Please take a look when you have a moment. If these results look solid
>>>>> to you, I’d like to move this toward a vote.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Viquar Khan
>>>>>
>>>>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi  Herman,
>>>>>>
>>>>>> I have enabled the comments and appreciate your feedback.
>>>>>>
>>>>>> Regards,
>>>>>> Vaquar khan
>>>>>>
>>>>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Vaquar,
>>>>>>>
>>>>>>> Can you enable comments on the doc?
>>>>>>>
>>>>>>> In general I am not against making improvements in this area.
>>>>>>> However the devil is very much in the details here.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Herman
>>>>>>>
>>>>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I’ve been following the rapid maturation of *Spark Connect* in the
>>>>>>>> 4.x release and have been identifying areas where remote execution can
>>>>>>>> reach parity with Spark Classic .
>>>>>>>>
>>>>>>>> While the remote execution model elegantly decouples the client
>>>>>>>> from the JVM, I am concerned about a performance regression in 
>>>>>>>> interactive
>>>>>>>> and high-complexity workloads.
>>>>>>>>
>>>>>>>> Specifically, the current implementation of *Eager Analysis* (
>>>>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC
>>>>>>>> round-trips that block the client thread. In environments with high 
>>>>>>>> network
>>>>>>>> latency, these blocking calls create a "Death by 1000 RPCs"
>>>>>>>> bottleneck—often forcing developers to write suboptimal, 
>>>>>>>> "Connect-specific"
>>>>>>>> code to avoid metadata requests .
>>>>>>>>
>>>>>>>> *Proposal*:
>>>>>>>>
>>>>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy
>>>>>>>> Prefetching) within the Spark Connect protocol. Key pillars include:
>>>>>>>>
>>>>>>>>    1.
>>>>>>>>
>>>>>>>>    *Plan-Piggybacking:* Allowing the *SparkConnectService* to
>>>>>>>>    return resolved schemas of relations during standard plan execution.
>>>>>>>>    2.
>>>>>>>>
>>>>>>>>    *Local Schema Cache:* A configurable client-side cache in the
>>>>>>>>    *SparkSession* to store resolved schemas.
>>>>>>>>    3.
>>>>>>>>
>>>>>>>>    *Batched Analysis API:* An extension to the *AnalyzePlan*
>>>>>>>>    protocol to allow schema resolution for multiple DataFrames in a 
>>>>>>>> single
>>>>>>>>    batch call.
>>>>>>>>
>>>>>>>> This shift would ensure that Spark Connect provides the same
>>>>>>>> "fluid" interactive experience as Spark Classic, removing the
>>>>>>>> $O(N)$ network latency overhead for metadata-heavy operations .
>>>>>>>>
>>>>>>>> I have drafted a full SPIP document ready for review  , which
>>>>>>>> includes the proposed changes for the *SparkConnectService* and
>>>>>>>> *AnalyzePlan* handlers.
>>>>>>>>
>>>>>>>> *SPIP Doc:*
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>>>>>>
>>>>>>>> Before I finalize the JIRA, has there been any recent internal
>>>>>>>> discussion regarding metadata prefetching or batching analysis 
>>>>>>>> requests in
>>>>>>>> the current Spark Connect roadmap ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vaquar Khan
>>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Vaquar Khan
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Vaquar Khan
>>>>>
>>>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>>
>>>
>>
>> --
>> Ruifeng Zheng
>> E-mail: [email protected]
>>
>
>
> --
> Regards,
> Vaquar Khan
>
>

Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Reply via email to