[
https://issues.apache.org/jira/browse/SPARK-55163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
vaquar khan updated SPARK-55163:
--------------------------------
Description:
{panel}
*This SPIP proposes adding a client-side schema cache for Spark Connect
DataFrames.*
Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous
gRPC analysis request to the server. While these are local and near-instant in
Spark Classic, in Connect they average 277 ms on standard cloud setups (like
AWS t3.medium). This makes iterative work extremely slow; we've measured a
13-second lag for 50 metadata calls in a typical ETL pipeline.
This delay is forcing developers to use a "Shadow Schema" pattern, where they
manually track column names in local lists to avoid the RPC overhead. Since
Spark DataFrames are immutable, we can fix this by caching the resolved schema
on the client after the first request. Our POC shows this reduces the 13-second
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect
model.
I have followed the official SPIP template for the detailed breakdown below.
*SIP*
[
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
*Benchmark* -
[https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
{panel}
was:
{panel}
This SPIP proposes adding a client-side schema cache for Spark Connect
DataFrames.
Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous
gRPC analysis request to the server. While these are local and near-instant in
Spark Classic, in Connect they average 277 ms on standard cloud setups (like
AWS t3.medium). This makes iterative work extremely slow; we've measured a
13-second lag for 50 metadata calls in a typical ETL pipeline.
This delay is forcing developers to use a "Shadow Schema" pattern, where they
manually track column names in local lists to avoid the RPC overhead. Since
Spark DataFrames are immutable, we can fix this by caching the resolved schema
on the client after the first request. Our POC shows this reduces the 13-second
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect
model.
I have followed the official SPIP template for the detailed breakdown below.
*SIP*
[
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
*Benchmark* -
https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0
{panel}
> SPIP: Client-Side Metadata Caching for Spark Connect
> ----------------------------------------------------
>
> Key: SPARK-55163
> URL: https://issues.apache.org/jira/browse/SPARK-55163
> Project: Spark
> Issue Type: Improvement
> Components: Connect
> Affects Versions: 3.4.0
> Reporter: vaquar khan
> Priority: Major
> Labels: connect, spark
>
> {panel}
> *This SPIP proposes adding a client-side schema cache for Spark Connect
> DataFrames.*
> Currently, every call to {{df.columns}} or {{df.schema}} triggers a
> synchronous gRPC analysis request to the server. While these are local and
> near-instant in Spark Classic, in Connect they average 277 ms on standard
> cloud setups (like AWS t3.medium). This makes iterative work extremely slow;
> we've measured a 13-second lag for 50 metadata calls in a typical ETL
> pipeline.
> This delay is forcing developers to use a "Shadow Schema" pattern, where they
> manually track column names in local lists to avoid the RPC overhead. Since
> Spark DataFrames are immutable, we can fix this by caching the resolved
> schema on the client after the first request. Our POC shows this reduces the
> 13-second lag to about 250 ms (a 51× speedup) without breaking the core Spark
> Connect model.
>
> I have followed the official SPIP template for the detailed breakdown below.
> *SIP*
> [
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
> *Benchmark* -
> [https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
> {panel}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]