Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

vaquar khan Sat, 07 Feb 2026 12:44:53 -0800

Hi Erik and Herman,

Thanks for the feedback on narrowing the scope. I have updated the SPIP (
SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>) to focus
strictly on Phase 1: Client-Side Plan-ID Caching.


I spent some time looking at the pyspark.sql.connect client code and found
that while there is already a cache check in dataframe.py:1898, it is
strictly instance-bound. This explains the "Death by 1000 RPCs" bottleneck
we are seeing: every time a user does something like.filter() or.limit(),
it creates a new DataFrame instance with an empty cache. This forces a
fresh 277 ms AnalyzePlan RPC even if the schema is exactly the same as the
parent.

In my testing on Spark 4.0.0-preview, a sequence of 50 metadata calls on
derived DataFrames took 13.2 seconds. With the proposed Plan-ID cache, that
same sequence dropped to 0.25 seconds—a 51x speedup.

By focusing only on this caching layer, we can solve the primary
performance issue with zero protocol changes and no impact on the
user-facing API. I've moved the more complex ideas like background
asynchronicity—which Erik noted as a "can of worms" regarding
consistency—to a future work section to keep this Phase 1 focused and safe.

Updated SPIP:
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing

I would appreciate it if you could take a look at this narrowed version. Is
anyone from the PMC open to shepherding this Phase 1?

Regards,
Vaquar Khan
https://www.linkedin.com/in/vaquar-khan-b695577/

On Sun, 25 Jan 2026 at 10:53, Erik Krogen <[email protected]> wrote:

> My 2c — this seems like 3 mostly unrelated proposals that should be
> separated out. Caching of schema information in the Spark Connect client
> seems uncontroversial (as long as the behavior is controllable / gated
> behind a flag), and AFAICT, addresses your concerns.
>
> Batch resolution is interesting and I can imagine use cases, but it would
> require new APIs (AFAICT) and user logic changes, which doesn’t seem to
> solve your initial problem statement of performance degradation when
> migrating from Classic to Connect.
>
> Asynchronous resolution is a big can of worms that can fundamentally
> change the expected behavior of the APIs.
>
> I think you will have more luck if you narrowly scope this proposal to
> just client-side caching.
>
> On Fri, Jan 23, 2026 at 8:09 PM vaquar khan <[email protected]> wrote:
>
>> Hi Herman,
>>
>> Sorry for the delay in getting back to you. I’ve finished the
>> comprehensive benchmarking
>> <https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for
>> the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have updated
>> the SPIP draft
>> <https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0>
>> and JIRA SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>
>> ("Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect")
>> with the findings.
>>
>> As we’ve discussed, the transition to the gRPC client-server model
>> introduced a significant latency penalty for metadata-heavy workloads. My
>> research into a Client-Side Metadata Skip-Layer, using a deterministic Plan
>> ID strategy, shows that we can bypass these physical network constraints.
>> The performance gains actually ended up exceeding our initial projections.
>>
>>
>> *Here are the key results from the testing (conducted on Spark
>> 4.0.0-preview):*
>>     - Baseline Latency Confirmed: We measured a consistent 277 ms latency
>> for a single df.columns RPC call. Our analysis shows this is split roughly
>> between Catalyst analysis (~27%) and network RTT/serialization (~23%).
>>
>>     The Uncached Bottleneck: For a sequence of 50 metadata checks—which
>> is common in complex ETL loops or frameworks like Great Expectations—the
>> uncached architecture resulted in 13.2 seconds of blocking overhead.
>>
>>     - Performance with Caching: With the SPARK-45123 Plan ID caching
>> enabled, that same 50-call sequence finished in just 0.25 seconds.
>>
>>     - Speedup: This is a *51× speedup for 50 operations*, and my
>> projections show this scaling to a *108× speedup for 100 operations*.
>>
>>     - RPC Elimination: By exploiting DataFrame immutability and using
>> Plan ID invalidation for correctness, we effectively eliminated 99% of
>> metadata RPCs in these iterative flows.
>>
>> This essentially solves the "Shadow Schema" problem where developers were
>> being forced to manually track columns in local lists just to keep their
>> notebooks responsive.
>>
>> Updated SPIP Draft:(
>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
>> )
>>
>> Please take a look when you have a moment. If these results look solid to
>> you, I’d like to move this toward a vote.
>>
>> Best regards,
>>
>> Viquar Khan
>>
>> On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> wrote:
>>
>>> Hi  Herman,
>>>
>>> I have enabled the comments and appreciate your feedback.
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <
>>> [email protected]> wrote:
>>>
>>>> Hi Vaquar,
>>>>
>>>> Can you enable comments on the doc?
>>>>
>>>> In general I am not against making improvements in this area. However
>>>> the devil is very much in the details here.
>>>>
>>>> Cheers,
>>>> Herman
>>>>
>>>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I’ve been following the rapid maturation of *Spark Connect* in the
>>>>> 4.x release and have been identifying areas where remote execution can
>>>>> reach parity with Spark Classic .
>>>>>
>>>>> While the remote execution model elegantly decouples the client from
>>>>> the JVM, I am concerned about a performance regression in interactive and
>>>>> high-complexity workloads.
>>>>>
>>>>> Specifically, the current implementation of *Eager Analysis* (
>>>>> df.columns, df.schema, etc.) relies on synchronous gRPC round-trips
>>>>> that block the client thread. In environments with high network latency,
>>>>> these blocking calls create a "Death by 1000 RPCs" bottleneck—often 
>>>>> forcing
>>>>> developers to write suboptimal, "Connect-specific" code to avoid metadata
>>>>> requests .
>>>>>
>>>>> *Proposal*:
>>>>>
>>>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy
>>>>> Prefetching) within the Spark Connect protocol. Key pillars include:
>>>>>
>>>>>    1.
>>>>>
>>>>>    *Plan-Piggybacking:* Allowing the *SparkConnectService* to return
>>>>>    resolved schemas of relations during standard plan execution.
>>>>>    2.
>>>>>
>>>>>    *Local Schema Cache:* A configurable client-side cache in the
>>>>>    *SparkSession* to store resolved schemas.
>>>>>    3.
>>>>>
>>>>>    *Batched Analysis API:* An extension to the *AnalyzePlan* protocol
>>>>>    to allow schema resolution for multiple DataFrames in a single batch 
>>>>> call.
>>>>>
>>>>> This shift would ensure that Spark Connect provides the same "fluid"
>>>>> interactive experience as Spark Classic, removing the $O(N)$ network
>>>>> latency overhead for metadata-heavy operations .
>>>>>
>>>>> I have drafted a full SPIP document ready for review  , which includes
>>>>> the proposed changes for the *SparkConnectService* and *AnalyzePlan*
>>>>> handlers.
>>>>>
>>>>> *SPIP Doc:*
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>>>
>>>>> Before I finalize the JIRA, has there been any recent internal
>>>>> discussion regarding metadata prefetching or batching analysis requests in
>>>>> the current Spark Connect roadmap ?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Vaquar Khan
>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>
>>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>>
>>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>

-- 
Regards,
Vaquar Khan

Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Reply via email to