GitHub user GlutenPerfBot created a discussion: December 19, 2025: Weekly Status Update in Gluten
*This weekly update is generated by LLMs. You're welcome to join our [Github](https://github.com/apache/incubator-gluten/discussions) for in-depth discussions.* ## Overall Activity Summary The Gluten community merged 25 PRs and opened 9 new ones over the past 7 days. Major themes include Spark 4.0 compatibility fixes, daily Velox version bumps, infrastructure clean-ups, and new Flink connector work. Contributors are actively polishing the code base ahead of the next release. ## Key Ongoing Projects - **Spark 4.0 compatibility** – @baibaichen, @zhztheplayer and @zhouyuan continue to burn down failing UTs (#11088); recent fixes cover Parquet IO, JSON functions, Arrow Python and Delta update commands - **Daily Velox integration** – @GlutenPerfBot lands fresh Velox commits every day (#6887); this keeps Gluten in sync with upstream performance and function improvements - **Flink ecosystem** – @KevinyhZou added Kafka source support (#9553, #11312) and filesystem sink (#10064, #11300), expanding Gluten beyond Spark - **Code-quality & build** – @xinghuayu007 introduced IWYU (#11287) and clang-tidy (#11120) checks; @PHILO-HE cleaned legacy scripts (#11305) and directory names (#10219) - **New backends** – @WangGuangxin posted an early "Bolt" backend PoC (#11261), a Velox fork from ByteDance ## Priority Items - **Memory manager stability** – #11249 by @rui-mo (merged) fixes task-level race between Velox destructor and async I/O; critical for production - **Z-standard compression** – #11284 by @wecharyu (merged) aligns Velox level with Spark default (3) to avoid bigger Parquet files - **Spark 4.0 test failures** – #11088 still blocks release; suites ArrowEvalPython, JsonFunctions, CSV, Hive need owners - **GPU build broken** – #11302 (closed) required gcc-14 + cuda-toolkit-13.1; monitor follow-up PR #11275 by @zhouyuan - **Parquet metadata validation** – disabled by default (#11233, #11307) until performance regression is solved ## Notable Discussions - #11290 (Weekly status) – community call for EMR deployment guide and help with static/dynamic linking of libstdc++ - #11279 – @ammarchalifah asks for AWS EMR best-practice documentation; good place to contribute docs - #11282 – production alert on linking strategy; maintainers weighing static vs dynamic libgcc/libstdc++ ## Emerging Trends 1. Multi-engine support: Flink Kafka/filesystem sinks show Gluten positioning as a universal native accelerator 2. Release readiness: flurry of minor clean-ups (TPP.txt removal, script typos, dead config flags) signals prep for stable branch 3. Backend diversification: Omni and Bolt experiments indicate demand for vendor-specific optimizations 4. Memory & stability focus: recent fixes for OOM-GC interaction, broadcast spill, and memory-manager teardown highlight production hardening ## Good First Issues - #6814 – Implement MakeYMInterval expression for ClickHouse; straightforward function mapping - #4730 – Add date_from_unix_date function for ClickHouse; already prototyped in #10026 by @soupam05—needs review - #6807 – Support split_part function for ClickHouse; string manipulation with clear spec - #6812 – Add SparkPartitionID function for ClickHouse; useful for partition-aware queries - #6815 – Support MapZipWith expression for ClickHouse; slightly more advanced but well-documented All issues above are self-contained, require basic C++ and ClickHouse knowledge, and come with existing examples in the codebase—perfect for first-time contributors to learn Gluten's function registration flow. GitHub link: https://github.com/apache/incubator-gluten/discussions/11315 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
