Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

vaquar khan Sun, 08 Feb 2026 09:04:38 -0800

Hi Ruifeng,

You are correct regarding filter and limit—I verified in dataframe.py that
these operators do propagate _cached_schema correctly. Thanks for flagging
that.However, this investigation helped isolate the actual structural
bottleneck: schema-mutating transformations.


Operations like select, withColumn, and join fundamentally alter the plan
structure and cannot use simple instance-propagation. Currently, a loop
executing df.select(...) forces a blocking 277ms RPC for every iteration
because the client treats every new DataFrame instance as a cold start.

This is where the Plan-ID architecture is essential. By hashing the
unresolved plan, we can detect that select("col") produces a deterministic
schema, even across different DataFrame instances.

I’ve updated the SPIP to strictly target these unoptimized schema-mutating
workloads. our SIP  is  critical for interactive performance in data
quality and ETL frameworks.

Updated doc:
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0

Regards,
Vaquar Khan
https://www.linkedin.com/in/vaquar-khan-b695577/

On Sat, 7 Feb 2026 at 22:49, Ruifeng Zheng <[email protected]> wrote:

> Hi Vaquar,
>
>
> > every time a user does something like.filter() or.limit(), it creates a
> new DataFrame instance with an empty cache. This forces a fresh 277 ms
> AnalyzePlan RPC even if the schema is exactly the same as the parent.
>
> Is this true? I think the cached schema is already propagated in operators
> like `filter` and `limit`, see
>
>
> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L560-L567
>
> https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L792-L795
>
>
> On Sun, Feb 8, 2026 at 4:44 AM vaquar khan <[email protected]> wrote:
>
>> Hi Erik and Herman,
>>
>> Thanks for the feedback on narrowing the scope. I have updated the SPIP (
>> SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to
>> focus strictly on Phase 1: Client-Side Plan-ID Caching.
>>
>> I spent some time looking at the pyspark.sql.connect client code and
>> found that while there is already a cache check in dataframe.py:1898, it is
>> strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck
>> we are seeing: every time a user does something like.filter() or.limit(),
>> it creates a new DataFrame instance with an empty cache. This forces a
>> fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the
>> parent.
>>
>> In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls on
>> derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache, that
>> same sequence dropped to 0.25 seconds—a 51x speedup.
>>
>> By focusing only on this caching layer, we can solve the primary
>> performance issue with zero protocol changes and no impact on the
>> user-facing API. I've moved the more complex ideas like background
>> asynchronicity—which Erik noted as a "can of worms" regarding
>> consistency—to a future work section to keep this Phase 1 focused and safe.
>>
>> Updated SPIP:
>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>
>> I would appreciate it if you could take a look at this narrowed version.
>> Is anyone from the PMC open to shepherding this Phase 1?
>>
>> Regards,
>> Vaquar Khan
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>> On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote:
>>
>>> My 2c — this seems like 3 mostly unrelated proposals that should be
>>> separated out. Caching of schema information in the Spark Connect client
>>> seems uncontroversial (as long as the behavior is controllable / gated
>>> behind a flag), and AFAICT, addresses your concerns.
>>>
>>> Batch resolution is interesting and I can imagine use cases, but it
>>> would require new APIs (AFAICT) and user logic changes, which doesn’t seem
>>> to solve your initial problem statement of performance degradation when
>>> migrating from Classic to Connect.
>>>
>>> Asynchronous resolution is a big can of worms that can fundamentally
>>> change the expected behavior of the APIs.
>>>
>>> I think you will have more luck if you narrowly scope this proposal to
>>> just client-side caching.
>>>
>>> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Herman,
>>>>
>>>> Sorry for the delay in getting back to you. I’ve finished the
>>>> comprehensive benchmarking
>>>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for
>>>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have
>>>> updated the SPIP draft
>>>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0>
>>>> and JIRA SPARK-55163
>>>> <https://issues.apache.org/jira/browse/SPARK-55163> ("Asynchronous
>>>> Metadata Resolution & Lazy Prefetching for Spark Connect") with the
>>>> findings.
>>>>
>>>> As we’ve discussed, the transition to the gRPC client-server model
>>>> introduced a significant latency penalty for metadata-heavy workloads. My
>>>> research into a Client-Side Metadata Skip-Layer, using a deterministic Plan
>>>> ID strategy, shows that we can bypass these physical network constraints.
>>>> The performance gains actually ended up exceeding our initial projections.
>>>>
>>>>
>>>> *Here are the key results from the testing (conducted on Spark
>>>> 4.0.0-preview):*
>>>>     - Baseline Latency Confirmed: We measured a consistent 277 ms
>>>> latency for a single df.columns RPC call. Our analysis shows this is split
>>>> roughly between Catalyst analysis (~27%) and network RTT/serialization
>>>> (~23%).
>>>>
>>>>     The Uncached Bottleneck: For a sequence of 50 metadata checks—which
>>>> is common in complex ETL loops or frameworks like Great Expectations—the
>>>> uncached architecture resulted in 13.2 seconds of blocking overhead.
>>>>
>>>>     - Performance with Caching: With the SPARK-45123 Plan ID caching
>>>> enabled, that same 50-call sequence finished in just 0.25 seconds.
>>>>
>>>>     - Speedup: This is a *51× speedup for 50 operations*, and my
>>>> projections show this scaling to a *108× speedup for 100 operations*.
>>>>
>>>>     - RPC Elimination: By exploiting DataFrame immutability and using
>>>> Plan ID invalidation for correctness, we effectively eliminated 99% of
>>>> metadata RPCs in these iterative flows.
>>>>
>>>> This essentially solves the "Shadow Schema" problem where developers
>>>> were being forced to manually track columns in local lists just to keep
>>>> their notebooks responsive.
>>>>
>>>> Updated SPIP Draft:(
>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>>>> )
>>>>
>>>> Please take a look when you have a moment. If these results look solid
>>>> to you, I’d like to move this toward a vote.
>>>>
>>>> Best regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> wrote:
>>>>
>>>>> Hi  Herman,
>>>>>
>>>>> I have enabled the comments and appreciate your feedback.
>>>>>
>>>>> Regards,
>>>>> Vaquar khan
>>>>>
>>>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Vaquar,
>>>>>>
>>>>>> Can you enable comments on the doc?
>>>>>>
>>>>>> In general I am not against making improvements in this area. However
>>>>>> the devil is very much in the details here.
>>>>>>
>>>>>> Cheers,
>>>>>> Herman
>>>>>>
>>>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I’ve been following the rapid maturation of *Spark Connect* in the
>>>>>>> 4.x release and have been identifying areas where remote execution can
>>>>>>> reach parity with Spark Classic .
>>>>>>>
>>>>>>> While the remote execution model elegantly decouples the client from
>>>>>>> the JVM, I am concerned about a performance regression in interactive 
>>>>>>> and
>>>>>>> high-complexity workloads.
>>>>>>>
>>>>>>> Specifically, the current implementation of *Eager Analysis* (
>>>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC round-trips
>>>>>>> that block the client thread. In environments with high network latency,
>>>>>>> these blocking calls create a "Death by 1000 RPCs" bottleneck—often 
>>>>>>> forcing
>>>>>>> developers to write suboptimal, "Connect-specific" code to avoid 
>>>>>>> metadata
>>>>>>> requests .
>>>>>>>
>>>>>>> *Proposal*:
>>>>>>>
>>>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy
>>>>>>> Prefetching) within the Spark Connect protocol. Key pillars include:
>>>>>>>
>>>>>>>    1.
>>>>>>>
>>>>>>>    *Plan-Piggybacking:* Allowing the *SparkConnectService* to
>>>>>>>    return resolved schemas of relations during standard plan execution.
>>>>>>>    2.
>>>>>>>
>>>>>>>    *Local Schema Cache:* A configurable client-side cache in the
>>>>>>>    *SparkSession* to store resolved schemas.
>>>>>>>    3.
>>>>>>>
>>>>>>>    *Batched Analysis API:* An extension to the *AnalyzePlan*
>>>>>>>    protocol to allow schema resolution for multiple DataFrames in a 
>>>>>>> single
>>>>>>>    batch call.
>>>>>>>
>>>>>>> This shift would ensure that Spark Connect provides the same "fluid"
>>>>>>> interactive experience as Spark Classic, removing the $O(N)$
>>>>>>> network latency overhead for metadata-heavy operations .
>>>>>>>
>>>>>>> I have drafted a full SPIP document ready for review  , which
>>>>>>> includes the proposed changes for the *SparkConnectService* and
>>>>>>> *AnalyzePlan* handlers.
>>>>>>>
>>>>>>> *SPIP Doc:*
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>>>>>
>>>>>>> Before I finalize the JIRA, has there been any recent internal
>>>>>>> discussion regarding metadata prefetching or batching analysis requests 
>>>>>>> in
>>>>>>> the current Spark Connect roadmap ?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vaquar Khan
>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Vaquar Khan
>>>>>
>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>>
>>>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>
>
> --
> Ruifeng Zheng
> E-mail: [email protected]
>


-- 
Regards,
Vaquar Khan

Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Reply via email to