[ 
https://issues.apache.org/jira/browse/SPARK-55163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vaquar khan updated SPARK-55163:
--------------------------------
    Description: 
{panel}
*This SPIP proposes adding a client-side schema cache for Spark Connect 
DataFrames.*

Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous 
gRPC analysis request to the server. While these are local and near-instant in 
Spark Classic, in Connect they average 277 ms on standard cloud setups (like 
AWS t3.medium). This makes iterative work extremely slow; we've measured a 
13-second lag for 50 metadata calls in a typical ETL pipeline.

This delay is forcing developers to use a "Shadow Schema" pattern, where they 
manually track column names in local lists to avoid the RPC overhead. Since 
Spark DataFrames are immutable, we can fix this by caching the resolved schema 
on the client after the first request. Our POC shows this reduces the 13-second 
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect 
model.
 
I have followed the official SPIP template for the detailed breakdown below.

*SIP*

[ 
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]

*Benchmark* - 
[https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]

*Note for GSoC -* 

To set clear expectations for your GSoC timeline, and as a heads-up to the 
broader Spark developer community:
Because the underlying SPIP (SPARK-55163) is still actively being discussed and 
has not yet received formal PMC approval, your GSoC project will function 
purely as an experimental prototype.

Your open Pull Requests will be used by mentors to evaluate your GSoC 
deliverables and milestones. However, please be aware that your code will not 
be merged into the mainline Apache Spark repository during the GSoC program. 
Successfully completing your GSoC project and passing the evaluations is tied 
to the quality of your prototype and testing, not to getting the code merged.

Your prototype will be incredibly valuable in helping the community benchmark 
the latency improvements for Spark Connect. I look forward to reviewing your 
finalized proposal!
{panel}
 

 

  was:
{panel}
*This SPIP proposes adding a client-side schema cache for Spark Connect 
DataFrames.*

Currently, every call to {{df.columns}} or {{df.schema}} triggers a synchronous 
gRPC analysis request to the server. While these are local and near-instant in 
Spark Classic, in Connect they average 277 ms on standard cloud setups (like 
AWS t3.medium). This makes iterative work extremely slow; we've measured a 
13-second lag for 50 metadata calls in a typical ETL pipeline.

This delay is forcing developers to use a "Shadow Schema" pattern, where they 
manually track column names in local lists to avoid the RPC overhead. Since 
Spark DataFrames are immutable, we can fix this by caching the resolved schema 
on the client after the first request. Our POC shows this reduces the 13-second 
lag to about 250 ms (a 51× speedup) without breaking the core Spark Connect 
model.
 
I have followed the official SPIP template for the detailed breakdown below.

*SIP*

[ 
https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]

*Benchmark* - 
[https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]

*Note -* 

To set clear expectations for your GSoC timeline, and as a heads-up to the 
broader Spark developer community:
Because the underlying SPIP (SPARK-55163) is still actively being discussed and 
has not yet received formal PMC approval, your GSoC project will function 
purely as an experimental prototype.

Your open Pull Requests will be used by mentors to evaluate your GSoC 
deliverables and milestones. However, please be aware that your code will not 
be merged into the mainline Apache Spark repository during the GSoC program. 
Successfully completing your GSoC project and passing the evaluations is tied 
to the quality of your prototype and testing, not to getting the code merged.

Your prototype will be incredibly valuable in helping the community benchmark 
the latency improvements for Spark Connect. I look forward to reviewing your 
finalized proposal!
{panel}
 

 


> SPIP: Client-Side Metadata Caching for Spark Connect
> ----------------------------------------------------
>
>                 Key: SPARK-55163
>                 URL: https://issues.apache.org/jira/browse/SPARK-55163
>             Project: Spark
>          Issue Type: Improvement
>          Components: Connect
>    Affects Versions: 3.4.0
>            Reporter: vaquar khan
>            Priority: Major
>              Labels: connect, gsoc2026, mentor, pull-request-available, spark
>
> {panel}
> *This SPIP proposes adding a client-side schema cache for Spark Connect 
> DataFrames.*
> Currently, every call to {{df.columns}} or {{df.schema}} triggers a 
> synchronous gRPC analysis request to the server. While these are local and 
> near-instant in Spark Classic, in Connect they average 277 ms on standard 
> cloud setups (like AWS t3.medium). This makes iterative work extremely slow; 
> we've measured a 13-second lag for 50 metadata calls in a typical ETL 
> pipeline.
> This delay is forcing developers to use a "Shadow Schema" pattern, where they 
> manually track column names in local lists to avoid the RPC overhead. Since 
> Spark DataFrames are immutable, we can fix this by caching the resolved 
> schema on the client after the first request. Our POC shows this reduces the 
> 13-second lag to about 250 ms (a 51× speedup) without breaking the core Spark 
> Connect model.
>  
> I have followed the official SPIP template for the detailed breakdown below.
> *SIP*
> [ 
> https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0|https://docs.google.com/document/d/1xTvL5YWnHu1jfXvjlKk2KeSv8JJC08dsD7mdbjjo9YE/edit?tab=t.0]
> *Benchmark* - 
> [https://docs.google.com/document/d/1ebX8CtTHN3Yf3AWxg7uttzaylxBLhEv-T94svhZg_uE/edit?tab=t.0]
> *Note for GSoC -* 
> To set clear expectations for your GSoC timeline, and as a heads-up to the 
> broader Spark developer community:
> Because the underlying SPIP (SPARK-55163) is still actively being discussed 
> and has not yet received formal PMC approval, your GSoC project will function 
> purely as an experimental prototype.
> Your open Pull Requests will be used by mentors to evaluate your GSoC 
> deliverables and milestones. However, please be aware that your code will not 
> be merged into the mainline Apache Spark repository during the GSoC program. 
> Successfully completing your GSoC project and passing the evaluations is tied 
> to the quality of your prototype and testing, not to getting the code merged.
> Your prototype will be incredibly valuable in helping the community benchmark 
> the latency improvements for Spark Connect. I look forward to reviewing your 
> finalized proposal!
> {panel}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to