Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

vaquar khan Fri, 23 Jan 2026 20:08:49 -0800

Hi Herman,

Sorry for the delay in getting back to you. I’ve finished the
comprehensive benchmarking
<https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0>for
the "*Death by 1000 RPCs*" bottleneck in Spark Connect and have
updated the SPIP
draft
<https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0>
and JIRA SPARK-55163 <https://issues.apache.org/jira/browse/SPARK-55163>
("Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect")
with the findings.


As we’ve discussed, the transition to the gRPC client-server model
introduced a significant latency penalty for metadata-heavy workloads. My
research into a Client-Side Metadata Skip-Layer, using a deterministic Plan
ID strategy, shows that we can bypass these physical network constraints.
The performance gains actually ended up exceeding our initial projections.


*Here are the key results from the testing (conducted on Spark
4.0.0-preview):*
    - Baseline Latency Confirmed: We measured a consistent 277 ms latency
for a single df.columns RPC call. Our analysis shows this is split roughly
between Catalyst analysis (~27%) and network RTT/serialization (~23%).

    The Uncached Bottleneck: For a sequence of 50 metadata checks—which is
common in complex ETL loops or frameworks like Great Expectations—the
uncached architecture resulted in 13.2 seconds of blocking overhead.

    - Performance with Caching: With the SPARK-45123 Plan ID caching
enabled, that same 50-call sequence finished in just 0.25 seconds.

    - Speedup: This is a *51× speedup for 50 operations*, and my
projections show this scaling to a *108× speedup for 100 operations*.

    - RPC Elimination: By exploiting DataFrame immutability and using Plan
ID invalidation for correctness, we effectively eliminated 99% of metadata
RPCs in these iterative flows.

This essentially solves the "Shadow Schema" problem where developers were
being forced to manually track columns in local lists just to keep their
notebooks responsive.

Updated SPIP Draft:(
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0
)

Please take a look when you have a moment. If these results look solid to
you, I’d like to move this toward a vote.

Best regards,

Viquar Khan

On Wed, 7 Jan 2026 at 09:38, vaquar khan <[email protected]> wrote:

> Hi  Herman,
>
> I have enabled the comments and appreciate your feedback.
>
> Regards,
> Vaquar khan
>
> On Wed, 7 Jan 2026 at 07:53, Herman van Hovell via dev <
> [email protected]> wrote:
>
>> Hi Vaquar,
>>
>> Can you enable comments on the doc?
>>
>> In general I am not against making improvements in this area. However the
>> devil is very much in the details here.
>>
>> Cheers,
>> Herman
>>
>> On Mon, Dec 29, 2025 at 1:15 PM vaquar khan <[email protected]>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I’ve been following the rapid maturation of *Spark Connect* in the 4.x
>>> release and have been identifying areas where remote execution can reach
>>> parity with Spark Classic .
>>>
>>> While the remote execution model elegantly decouples the client from the
>>> JVM, I am concerned about a performance regression in interactive and
>>> high-complexity workloads.
>>>
>>> Specifically, the current implementation of *Eager Analysis* (df.columns,
>>> df.schema, etc.) relies on synchronous gRPC round-trips that block the
>>> client thread. In environments with high network latency, these blocking
>>> calls create a "Death by 1000 RPCs" bottleneck—often forcing developers to
>>> write suboptimal, "Connect-specific" code to avoid metadata requests .
>>>
>>> *Proposal*:
>>>
>>> I propose we introduce a Client-Side Metadata Skip-Layer (Lazy
>>> Prefetching) within the Spark Connect protocol. Key pillars include:
>>>
>>>    1.
>>>
>>>    *Plan-Piggybacking:* Allowing the *SparkConnectService* to return
>>>    resolved schemas of relations during standard plan execution.
>>>    2.
>>>
>>>    *Local Schema Cache:* A configurable client-side cache in the
>>>    *SparkSession* to store resolved schemas.
>>>    3.
>>>
>>>    *Batched Analysis API:* An extension to the *AnalyzePlan* protocol
>>>    to allow schema resolution for multiple DataFrames in a single batch 
>>> call.
>>>
>>> This shift would ensure that Spark Connect provides the same "fluid"
>>> interactive experience as Spark Classic, removing the $O(N)$ network
>>> latency overhead for metadata-heavy operations .
>>>
>>> I have drafted a full SPIP document ready for review  , which includes
>>> the proposed changes for the *SparkConnectService* and *AnalyzePlan*
>>> handlers.
>>>
>>> *SPIP Doc:*
>>>
>>>
>>> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?usp=sharing
>>>
>>> Before I finalize the JIRA, has there been any recent internal
>>> discussion regarding metadata prefetching or batching analysis requests in
>>> the current Spark Connect roadmap ?
>>>
>>>
>>> Regards,
>>> Vaquar Khan
>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>
>>
>
> --
> Regards,
> Vaquar Khan
>
>

-- 
Regards,
Vaquar Khan

Re: SPIP - Asynchronous Metadata Resolution & Lazy Prefetching for Spark Connect

Reply via email to