[
https://issues.apache.org/jira/browse/SPARK-55163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
vaquar khan updated SPARK-55163:
--------------------------------
Labels: connect gsoc2026 mentor spark (was: connect spark)
> SPIP: Client-Side Metadata Caching for Spark Connect
> ----------------------------------------------------
>
> Key: SPARK-55163
> URL: https://issues.apache.org/jira/browse/SPARK-55163
> Project: Spark
> Issue Type: Improvement
> Components: Connect
> Affects Versions: 3.4.0
> Reporter: vaquar khan
> Priority: Major
> Labels: connect, gsoc2026, mentor, spark
>
> {panel}
> *This SPIP proposes adding a client-side schema cache for Spark Connect
> DataFrames.*
> Currently, every call to {{df.columns}} or {{df.schema}} triggers a
> synchronous gRPC analysis request to the server. While these are local and
> near-instant in Spark Classic, in Connect they average 277 ms on standard
> cloud setups (like AWS t3.medium). This makes iterative work extremely slow;
> we've measured a 13-second lag for 50 metadata calls in a typical ETL
> pipeline.
> This delay is forcing developers to use a "Shadow Schema" pattern, where they
> manually track column names in local lists to avoid the RPC overhead. Since
> Spark DataFrames are immutable, we can fix this by caching the resolved
> schema on the client after the first request. Our POC shows this reduces the
> 13-second lag to about 250 ms (a 51× speedup) without breaking the core Spark
> Connect model.
>
> I have followed the official SPIP template for the detailed breakdown below.
> *SIP*
> [
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
> *Benchmark* -
> [https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
> {panel}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]