GitHub user GlutenPerfBot created a discussion: August 22, 2025: Weekly Status Update in Gluten
*This weekly update is generated by LLMs. You're welcome to join our [Github](https://github.com/apache/incubator-gluten/discussions) for in-depth discussions.* ## Overall Activity Summary This week saw vibrant activity in the Gluten community, with a strong focus on enhancing data lake capabilities, optimizing core performance, and modernizing the codebase. Development was heavily concentrated on the Velox backend, with significant progress in Iceberg write support and shuffle performance. Additionally, a major effort to simplify the project by removing legacy hardware accelerator support is underway. The Flink integration continues to mature with new features, and proactive work on Spark 4.0 compatibility ensures the project stays future-ready. ## Key Ongoing Projects Several major initiatives are pushing the project forward, driven by dedicated contributors: * **Comprehensive Iceberg Write Support:** A major push to enhance data lake functionality is being led by @jinchengchenghh. This includes adding support for partitioned writes in #10497, collecting statistics in #10495, and enabling copy-on-write operations in #10458. * **Performance Optimizations:** * A significant improvement to the shuffle read process is proposed in #10499 by @marin-ma, which introduces a new `ColumnarShuffleReader` to merge input streams and improve performance for sort-based shuffles. * The long-running effort to optimize Broadcast Hash Join (BHJ) performance continues in #8931 by @JkSelf. This highly anticipated change has gathered extensive community feedback and aims for major performance gains. * Performance for Hive UDTFs is being improved in #10475 by @jiangjiangtian, which adds support for columnar partial generate to reduce costly row/columnar conversions. * **Codebase Modernization:** A large-scale refactoring is in progress to simplify the codebase by removing support for various hardware accelerators. This effort, led by @marin-ma, includes the removal of QAT support in #10489 and IAA support in #10480. * **Flink Integration:** The experimental Flink backend is maturing with key contributions. Work to support stateful operations was merged in #10320 by @shuai-xu, and expanded test coverage for nexmark was added in #10468 by @KevinyhZou. ## Priority Items We encourage the community to review and provide feedback on these important pull requests that are currently open: * **Broadcast Hash Join (BHJ) Optimization:** #8931 by @JkSelf is a massive, long-running PR with over 170 comments. Community review is crucial to help validate this complex change and move it toward merging. * **Shuffle Reader Enhancements:** #10499 by @marin-ma proposes a significant architectural change to improve shuffle performance. Feedback on the design and implementation would be highly valuable. * **Removing QAT Support:** #10489 by @marin-ma is a large refactoring that touches many parts of the codebase. This PR needs careful review to ensure a smooth transition and prevent regressions. ## Notable Discussions Several important conversations are shaping the future of Gluten: * A critical bug report in #10502 from @surnaik details JVM crashes on non-OpenJDK builds (Temurin, Azul) due to JNI exceptions. This is a major stability concern receiving active investigation. * A performance deep-dive is happening in #10214, where @ayushi-agarwal reported high deserialization times during shuffle reads. This discussion, with 20 comments, provides valuable context for the improvements proposed in PR #10499. * In #10465, community member @Iskander14yo has started work on adding Gluten to the popular ClickBench benchmark, sparking a conversation about benchmarking and performance validation. ## Emerging Trends Based on this week's activity, we've identified several key trends: * **Deepening Data Lake Integration:** The focus has clearly shifted from basic read support to comprehensive write capabilities for modern table formats like Iceberg, making Gluten a more complete solution for data lakehouse architectures. * **Focus on Code Health and Maintainability:** The large-scale removal of specialized hardware accelerator code indicates a strategic shift towards simplifying the codebase, reducing maintenance overhead, and focusing on software-based optimizations that benefit all users. * **Proactive Compatibility with Future Technologies:** The ongoing work to support Spark 4.0 (#8852) demonstrates a forward-looking approach, ensuring the community can seamlessly adopt new versions of Spark as they become available. ## Good First Issues Looking to make your first contribution to Gluten? These issues are well-defined and a great way to get started: * **#4730:** Add support for the `date_from_unix_date` function in the ClickHouse backend. * **#6807:** Implement the `split_part` function for the ClickHouse backend. * **#6812:** Add support for the `SparkPartitionID` function in the ClickHouse backend. * **#6814:** Implement the `MakeYMInterval` expression for the ClickHouse backend. These issues are excellent entry points for contributors with some C++ and Scala/Java experience. They involve implementing a single, well-scoped function, allowing you to get familiar with the codebase and contribution process without needing to understand the entire system. Welcome to the community GitHub link: https://github.com/apache/incubator-gluten/discussions/10510 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
