Hi Vaquar, I'm Goutam K, a GSoC 2026 applicant for the Client-Side Metadata Caching for Spark Connect project (SPIP SPARK-55163).
About me: Software Engineer at Accenture, with 6 merged PRs in apache/burr. Most relevant is PR #629 I traced iterative metadata calls through a multi-step RAG pipeline, identified where repeated round-trips were redundant, and built around that contract. I'll apply the same structured reading approach to the Spark Connect gRPC protocol and DataFrame internals. My approach: per-DataFrame instance cache keyed on plan ID using WeakValueDictionary, with double-checked locking for thread safety. Cache miss triggers exactly one gRPC call; all subsequent df.schema, df.columns, and df.dtypes accesses are served locally. Derived DataFrames (df.filter, df.select) get independent cache slots automatically because they carry new plan IDs. I'll be transparent: I haven't implemented a gRPC-level cache in PySpark before. My plan is to spend the first 2 weeks of community bonding writing a full plan-ID lifecycle document from the source and sharing it with you before writing any Python. I have confirmed 20–25 hrs/week availability during the GSoC period with no time off planned. Two questions I'd value your input on: - Is cache invalidation across session restarts already a concern raised in the SPIP thread, or should I raise it there? - Are there known edge cases in how plan IDs behave for DataFrames created from SQL queries or read operations that I should study first? Thanks for your time. Goutam K [email protected] | github.com/goutamk09
