Xtpacz opened a new pull request, #12259: URL: https://github.com/apache/gluten/pull/12259
Related issue: https://github.com/apache/gluten/issues/12251 <!-- Thank you for submitting a pull request! Here are some tips: 1. For first-time contributors, please read our contributing guide: https://github.com/apache/gluten/blob/main/CONTRIBUTING.md 2. If necessary, create a GitHub issue for discussion beforehand to avoid duplicate work. 3. If the PR is specific to a single backend, include [VL] or [CH] in the PR title to indicate the Velox or ClickHouse backend, respectively. 4. If the PR is not ready for review, please mark it as a draft. --> ## What changes are proposed in this pull request? This PR adds a new config `spark.gluten.velox.broadcastBuild.mergeBatches`(default `false`) that controls how columnar batches are serialized during broadcast build for BHJ. When enabled, all batches on the build side are serialized into a **single** buffer through a new `serializeAll` JNI entry point, so the executor-side `HashBuild` operator receives **one** `addInput` call instead of N. For broadcast tables that fan out into many small batches (e.g. when `spark.gluten.sql.columnar.maxBatchSize` is small or the build side is narrow), this materially reduces per-batch overhead. <!-- Provide a clear and concise description of the changes introduced in this PR. Ensure the PR description aligns with the code changes, especially after updates. If applicable, include "Fixes #<GitHub_Issue_ID>" to automatically close the corresponding issue when the PR is merged. --> ## How was this patch tested? Verified on an internal Spark cluster running TPC-DS 5TB. On q64, with all other configs equal, the aggregated **HashBuild total time** dropped from **2.04h to 22.1min** when `mergeBatches=true` (~5.5x reduction). Result correctness verified by comparing query output between `mergeBatches=true` and `=false`. <!-- Describe how the changes were tested, if applicable. Include new tests to validate the functionality, if necessary. For UI-related changes, attach screenshots to demonstrate the updates. --> ## Was this patch authored or co-authored using generative AI tooling? Co-authored using Claude (claude-opus-4.7) <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
