[D] August 22, 2025: Weekly Status Update in Gluten [incubator-gluten]

via GitHub Fri, 22 Aug 2025 13:18:38 -0700


GitHub user GlutenPerfBot created a discussion: August 22, 2025: Weekly Status 
Update in Gluten


*This weekly update is generated by LLMs. You're welcome to join our 
[Github](https://github.com/apache/incubator-gluten/discussions) for in-depth 
discussions.*

## Overall Activity Summary
This week saw vibrant activity in the Gluten community, with a strong focus on 
enhancing data lake capabilities, optimizing core performance, and modernizing 
the codebase. Development was heavily concentrated on the Velox backend, with 
significant progress in Iceberg write support and shuffle performance. 
Additionally, a major effort to simplify the project by removing legacy 
hardware accelerator support is underway. The Flink integration continues to 
mature with new features, and proactive work on Spark 4.0 compatibility ensures 
the project stays future-ready.

## Key Ongoing Projects
Several major initiatives are pushing the project forward, driven by dedicated 
contributors:

*   **Comprehensive Iceberg Write Support:** A major push to enhance data lake 
functionality is being led by @jinchengchenghh. This includes adding support 
for partitioned writes in #10497, collecting statistics in #10495, and enabling 
copy-on-write operations in #10458.
*   **Performance Optimizations:**
    *   A significant improvement to the shuffle read process is proposed in 
#10499 by @marin-ma, which introduces a new `ColumnarShuffleReader` to merge 
input streams and improve performance for sort-based shuffles.
    *   The long-running effort to optimize Broadcast Hash Join (BHJ) 
performance continues in #8931 by @JkSelf. This highly anticipated change has 
gathered extensive community feedback and aims for major performance gains.
    *   Performance for Hive UDTFs is being improved in #10475 by 
@jiangjiangtian, which adds support for columnar partial generate to reduce 
costly row/columnar conversions.
*   **Codebase Modernization:** A large-scale refactoring is in progress to 
simplify the codebase by removing support for various hardware accelerators. 
This effort, led by @marin-ma, includes the removal of QAT support in #10489 
and IAA support in #10480.
*   **Flink Integration:** The experimental Flink backend is maturing with key 
contributions. Work to support stateful operations was merged in #10320 by 
@shuai-xu, and expanded test coverage for nexmark was added in #10468 by 
@KevinyhZou.

## Priority Items
We encourage the community to review and provide feedback on these important 
pull requests that are currently open:

*   **Broadcast Hash Join (BHJ) Optimization:** #8931 by @JkSelf is a massive, 
long-running PR with over 170 comments. Community review is crucial to help 
validate this complex change and move it toward merging.
*   **Shuffle Reader Enhancements:** #10499 by @marin-ma proposes a significant 
architectural change to improve shuffle performance. Feedback on the design and 
implementation would be highly valuable.
*   **Removing QAT Support:** #10489 by @marin-ma is a large refactoring that 
touches many parts of the codebase. This PR needs careful review to ensure a 
smooth transition and prevent regressions.

## Notable Discussions
Several important conversations are shaping the future of Gluten:

*   A critical bug report in #10502 from @surnaik details JVM crashes on 
non-OpenJDK builds (Temurin, Azul) due to JNI exceptions. This is a major 
stability concern receiving active investigation.
*   A performance deep-dive is happening in #10214, where @ayushi-agarwal 
reported high deserialization times during shuffle reads. This discussion, with 
20 comments, provides valuable context for the improvements proposed in PR 
#10499.
*   In #10465, community member @Iskander14yo has started work on adding Gluten 
to the popular ClickBench benchmark, sparking a conversation about benchmarking 
and performance validation.

## Emerging Trends
Based on this week's activity, we've identified several key trends:

*   **Deepening Data Lake Integration:** The focus has clearly shifted from 
basic read support to comprehensive write capabilities for modern table formats 
like Iceberg, making Gluten a more complete solution for data lakehouse 
architectures.
*   **Focus on Code Health and Maintainability:** The large-scale removal of 
specialized hardware accelerator code indicates a strategic shift towards 
simplifying the codebase, reducing maintenance overhead, and focusing on 
software-based optimizations that benefit all users.
*   **Proactive Compatibility with Future Technologies:** The ongoing work to 
support Spark 4.0 (#8852) demonstrates a forward-looking approach, ensuring the 
community can seamlessly adopt new versions of Spark as they become available.

## Good First Issues
Looking to make your first contribution to Gluten? These issues are 
well-defined and a great way to get started:

*   **#4730:** Add support for the `date_from_unix_date` function in the 
ClickHouse backend.
*   **#6807:** Implement the `split_part` function for the ClickHouse backend.
*   **#6812:** Add support for the `SparkPartitionID` function in the 
ClickHouse backend.
*   **#6814:** Implement the `MakeYMInterval` expression for the ClickHouse 
backend.

These issues are excellent entry points for contributors with some C++ and 
Scala/Java experience. They involve implementing a single, well-scoped 
function, allowing you to get familiar with the codebase and contribution 
process without needing to understand the entire system. Welcome to the 
community

GitHub link: https://github.com/apache/incubator-gluten/discussions/10510

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[D] August 22, 2025: Weekly Status Update in Gluten [incubator-gluten]

Reply via email to