Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Ruifeng Zheng Sat, 07 Feb 2026 20:49:31 -0800

Hi Vaquar,


> every time a user does something like.filter() or.limit(), it creates a
new DataFrame instance with an empty cache. This forces a fresh 277 ms
AnalyzePlan RPC even if the schema is exactly the same as the parent.

Is this true? I think the cached schema is already propagated in operators
like `filter` and `limit`, see

https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L560-L567
https://github.com/apache/spark/blob/43529889f24011c3df1d308e8b673967818b7c33/python/pyspark/sql/connect/dataframe.py#L792-L795


On Sun, Feb 8, 2026 at 4:44 AM vaquar khan <[email protected]> wrote:

> Hi Erik and Herman,
>
> Thanks for the feedback on narrowing the scope. I have updated the SPIP (
> SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to focus
> strictly on Phase 1: Client-Side Plan-ID Caching.
>
> I spent some time looking at the pyspark.sql.connect client code and found
> that while there is already a cache check in dataframe.py:1898, it is
> strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck
> we are seeing: every time a user does something like.filter() or.limit(),
> it creates a new DataFrame instance with an empty cache. This forces a
> fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the
> parent.
>
> In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls on
> derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache, that
> same sequence dropped to 0.25 seconds—a 51x speedup.
>
> By focusing only on this caching layer, we can solve the primary
> performance issue with zero protocol changes and no impact on the
> user-facing API. I've moved the more complex ideas like background
> asynchronicity—which Erik noted as a "can of worms" regarding
> consistency—to a future work section to keep this Phase 1 focused and safe.
>
> Updated SPIP:
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>
> I would appreciate it if you could take a look at this narrowed version.
> Is anyone from the PMC open to shepherding this Phase 1?
>
> Regards,
> Vaquar Khan
> https://www.linkedin.com/in/vaquar-khan-b695577/
>
> On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote:
>
>> My 2c — this seems like 3 mostly unrelated proposals that should be
>> separated out. Caching of schema information in the Spark Connect client
>> seems uncontroversial (as long as the behavior is controllable / gated
>> behind a flag), and AFAICT, addresses your concerns.
>>
>> Batch resolution is interesting and I can imagine use cases, but it would
>> require new APIs (AFAICT) and user logic changes, which doesn’t seem to
>> solve your initial problem statement of performance degradation when
>> migrating from Classic to Connect.
>>
>> Asynchronous resolution is a big can of worms that can fundamentally
>> change the expected behavior of the APIs.
>>
>> I think you will have more luck if you narrowly scope this proposal to
>> just client-side caching.
>>
>> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]>
>> wrote:
>>
>>> Hi Herman,
>>>
>>> Sorry for the delay in getting back to you. I’ve finished the
>>> comprehensive benchmarking
>>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for
>>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have updated
>>> the SPIP draft
>>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0>
>>> and JIRA SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>
>>> ("Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect")
>>> with the findings.
>>>
>>> As we’ve discussed, the transition to the gRPC client-server model
>>> introduced a significant latency penalty for metadata-heavy workloads. My
>>> research into a Client-Side Metadata Skip-Layer, using a deterministic Plan
>>> ID strategy, shows that we can bypass these physical network constraints.
>>> The performance gains actually ended up exceeding our initial projections.
>>>
>>>
>>> *Here are the key results from the testing (conducted on Spark
>>> 4.0.0-preview):*
>>>     - Baseline Latency Confirmed: We measured a consistent 277 ms
>>> latency for a single df.columns RPC call. Our analysis shows this is split
>>> roughly between Catalyst analysis (~27%) and network RTT/serialization
>>> (~23%).
>>>
>>>     The Uncached Bottleneck: For a sequence of 50 metadata checks—which
>>> is common in complex ETL loops or frameworks like Great Expectations—the
>>> uncached architecture resulted in 13.2 seconds of blocking overhead.
>>>
>>>     - Performance with Caching: With the SPARK-45123 Plan ID caching
>>> enabled, that same 50-call sequence finished in just 0.25 seconds.
>>>
>>>     - Speedup: This is a *51× speedup for 50 operations*, and my
>>> projections show this scaling to a *108× speedup for 100 operations*.
>>>
>>>     - RPC Elimination: By exploiting DataFrame immutability and using
>>> Plan ID invalidation for correctness, we effectively eliminated 99% of
>>> metadata RPCs in these iterative flows.
>>>
>>> This essentially solves the "Shadow Schema" problem where developers
>>> were being forced to manually track columns in local lists just to keep
>>> their notebooks responsive.
>>>
>>> Updated SPIP Draft:(
>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>>> )
>>>
>>> Please take a look when you have a moment. If these results look solid
>>> to you, I’d like to move this toward a vote.
>>>
>>> Best regards,
>>>
>>> Viquar Khan
>>>
>>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> wrote:
>>>
>>>> Hi  Herman,
>>>>
>>>> I have enabled the comments and appreciate your feedback.
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Vaquar,
>>>>>
>>>>> Can you enable comments on the doc?
>>>>>
>>>>> In general I am not against making improvements in this area. However
>>>>> the devil is very much in the details here.
>>>>>
>>>>> Cheers,
>>>>> Herman
>>>>>
>>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I’ve been following the rapid maturation of *Spark Connect* in the
>>>>>> 4.x release and have been identifying areas where remote execution can
>>>>>> reach parity with Spark Classic .
>>>>>>
>>>>>> While the remote execution model elegantly decouples the client from
>>>>>> the JVM, I am concerned about a performance regression in interactive and
>>>>>> high-complexity workloads.
>>>>>>
>>>>>> Specifically, the current implementation of *Eager Analysis* (
>>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC round-trips
>>>>>> that block the client thread. In environments with high network latency,
>>>>>> these blocking calls create a "Death by 1000 RPCs" bottleneck—often 
>>>>>> forcing
>>>>>> developers to write suboptimal, "Connect-specific" code to avoid metadata
>>>>>> requests .
>>>>>>
>>>>>> *Proposal*:
>>>>>>
>>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy
>>>>>> Prefetching) within the Spark Connect protocol. Key pillars include:
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>>    *Plan-Piggybacking:* Allowing the *SparkConnectService* to return
>>>>>>    resolved schemas of relations during standard plan execution.
>>>>>>    2.
>>>>>>
>>>>>>    *Local Schema Cache:* A configurable client-side cache in the
>>>>>>    *SparkSession* to store resolved schemas.
>>>>>>    3.
>>>>>>
>>>>>>    *Batched Analysis API:* An extension to the *AnalyzePlan*
>>>>>>    protocol to allow schema resolution for multiple DataFrames in a 
>>>>>> single
>>>>>>    batch call.
>>>>>>
>>>>>> This shift would ensure that Spark Connect provides the same "fluid"
>>>>>> interactive experience as Spark Classic, removing the $O(N)$ network
>>>>>> latency overhead for metadata-heavy operations .
>>>>>>
>>>>>> I have drafted a full SPIP document ready for review  , which
>>>>>> includes the proposed changes for the *SparkConnectService* and
>>>>>> *AnalyzePlan* handlers.
>>>>>>
>>>>>> *SPIP Doc:*
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>>>>
>>>>>> Before I finalize the JIRA, has there been any recent internal
>>>>>> discussion regarding metadata prefetching or batching analysis requests 
>>>>>> in
>>>>>> the current Spark Connect roadmap ?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Vaquar Khan
>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>>
>>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>>
>>>
>
> --
> Regards,
> Vaquar Khan
>
>

-- 
Ruifeng Zheng
E-mail: [email protected]

Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Reply via email to