GSoC 2026 Applicant Intro Client Side Metadata Caching for Spark Connect

Goutam K Sun, 29 Mar 2026 14:44:36 -0700

Hi Vaquar,

I'm Goutam K, a GSoC 2026 applicant for the Client-Side Metadata Caching
for Spark Connect project (SPIP SPARK-55163).


About me: Software Engineer at Accenture, with 6 merged PRs in apache/burr.
Most relevant is PR #629 I traced iterative metadata calls through a
multi-step RAG pipeline, identified where repeated round-trips were
redundant, and built around that contract. I'll apply the same structured
reading approach to the Spark Connect gRPC protocol and DataFrame internals.

My approach: per-DataFrame instance cache keyed on plan ID using
WeakValueDictionary, with double-checked locking for thread safety. Cache
miss triggers exactly one gRPC call; all subsequent df.schema, df.columns,
and df.dtypes accesses are served locally. Derived DataFrames (df.filter,
df.select) get independent cache slots automatically because they carry new
plan IDs.

I'll be transparent: I haven't implemented a gRPC-level cache in PySpark
before. My plan is to spend the first 2 weeks of community bonding writing
a full plan-ID lifecycle document from the source and sharing it with you
before writing any Python.

I have confirmed 20–25 hrs/week availability during the GSoC period with no
time off planned.

Two questions I'd value your input on:

   - Is cache invalidation across session restarts already a concern raised
   in the SPIP thread, or should I raise it there?
   - Are there known edge cases in how plan IDs behave for DataFrames
   created from SQL queries or read operations that I should study first?

Thanks for your time.

Goutam K [email protected] | github.com/goutamk09

GSoC 2026 Applicant Intro Client Side Metadata Caching for Spark Connect

Reply via email to