[ 
https://issues.apache.org/jira/browse/SPARK-55163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vaquar khan updated SPARK-55163:
--------------------------------
    Labels: connect gsoc2026 mentor spark  (was: connect spark)

> SPIP: Client-Side Metadata Caching for Spark Connect
> ----------------------------------------------------
>
>                 Key: SPARK-55163
>                 URL: https://issues.apache.org/jira/browse/SPARK-55163
>             Project: Spark
>          Issue Type: Improvement
>          Components: Connect
>    Affects Versions: 3.4.0
>            Reporter: vaquar khan
>            Priority: Major
>              Labels: connect, gsoc2026, mentor, spark
>
> {panel}
> *This SPIP proposes adding a client-side schema cache for Spark Connect 
> DataFrames.*
> Currently, every call to {{df.columns}} or {{df.schema}} triggers a 
> synchronous gRPC analysis request to the server. While these are local and 
> near-instant in Spark Classic, in Connect they average 277 ms on standard 
> cloud setups (like AWS t3.medium). This makes iterative work extremely slow; 
> we've measured a 13-second lag for 50 metadata calls in a typical ETL 
> pipeline.
> This delay is forcing developers to use a "Shadow Schema" pattern, where they 
> manually track column names in local lists to avoid the RPC overhead. Since 
> Spark DataFrames are immutable, we can fix this by caching the resolved 
> schema on the client after the first request. Our POC shows this reduces the 
> 13-second lag to about 250 ms (a 51× speedup) without breaking the core Spark 
> Connect model.
>  
> I have followed the official SPIP template for the detailed breakdown below.
> *SIP*
> [ 
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
> *Benchmark* - 
> [https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
> {panel}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to