[jira] [Updated] (SPARK-55163) SPIP: Client-Side Metadata Caching for Spark Connect

vaquar khan (Jira) Fri, 23 Jan 2026 19:51:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-55163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


vaquar khan updated SPARK-55163:
--------------------------------
    Description: 
{panel}
*This SPIP proposes adding a client-side schema cache for Spark Connect 
DataFrames.*

Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous 
gRPC analysis request to the server. While these are local and near-instant in 
Spark Classic, in Connect they average 277 ms on standard cloud setups (like 
AWS t3.medium). This makes iterative work extremely slow; we've measured a 
13-second lag for 50 metadata calls in a typical ETL pipeline.

This delay is forcing developers to use a "Shadow Schema" pattern, where they 
manually track column names in local lists to avoid the RPC overhead. Since 
Spark DataFrames are immutable, we can fix this by caching the resolved schema 
on the client after the first request. Our POC shows this reduces the 13-second 
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect 
model.
 
I have followed the official SPIP template for the detailed breakdown below.

*SIP*

[ 
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]

*Benchmark* - 
[https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
{panel}
 

  was:
{panel}
This SPIP proposes adding a client-side schema cache for Spark Connect 
DataFrames.

Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous 
gRPC analysis request to the server. While these are local and near-instant in 
Spark Classic, in Connect they average 277 ms on standard cloud setups (like 
AWS t3.medium). This makes iterative work extremely slow; we've measured a 
13-second lag for 50 metadata calls in a typical ETL pipeline.

This delay is forcing developers to use a "Shadow Schema" pattern, where they 
manually track column names in local lists to avoid the RPC overhead. Since 
Spark DataFrames are immutable, we can fix this by caching the resolved schema 
on the client after the first request. Our POC shows this reduces the 13-second 
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect 
model.
 
I have followed the official SPIP template for the detailed breakdown below.

*SIP*

[ 
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]

*Benchmark* - 
https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0
{panel}
 


> SPIP: Client-Side Metadata Caching for Spark Connect
> ----------------------------------------------------
>
>                 Key: SPARK-55163
>                 URL: https://issues.apache.org/jira/browse/SPARK-55163
>             Project: Spark
>          Issue Type: Improvement
>          Components: Connect
>    Affects Versions: 3.4.0
>            Reporter: vaquar khan
>            Priority: Major
>              Labels: connect, spark
>
> {panel}
> *This SPIP proposes adding a client-side schema cache for Spark Connect 
> DataFrames.*
> Currently, every call to {{df.columns}} or {{df.schema}} triggers a 
> synchronous gRPC analysis request to the server. While these are local and 
> near-instant in Spark Classic, in Connect they average 277 ms on standard 
> cloud setups (like AWS t3.medium). This makes iterative work extremely slow; 
> we've measured a 13-second lag for 50 metadata calls in a typical ETL 
> pipeline.
> This delay is forcing developers to use a "Shadow Schema" pattern, where they 
> manually track column names in local lists to avoid the RPC overhead. Since 
> Spark DataFrames are immutable, we can fix this by caching the resolved 
> schema on the client after the first request. Our POC shows this reduces the 
> 13-second lag to about 250 ms (a 51× speedup) without breaking the core Spark 
> Connect model.
>  
> I have followed the official SPIP template for the detailed breakdown below.
> *SIP*
> [ 
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
> *Benchmark* - 
> [https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
> {panel}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-55163) SPIP: Client-Side Metadata Caching for Spark Connect

Reply via email to