vigneshsiva11 opened a new pull request, #3233:
URL: https://github.com/apache/datafusion-comet/pull/3233

   ## Which issue does this PR close?
   
   Closes #3225.
   
   ## Rationale for this change
   
   Currently, while Comet implements field-major processing for top-level 
struct fields, it falls back to slow row-major processing (using 
`append_field`) when it encounters complex nested types like Structs inside 
Structs. This fallback involves a significant performance penalty because it 
requires per-row type dispatch and memory access.
   
   By extending the field-major optimization to nested Struct fields, we 
achieve a more "vectorized" approach that maintains cache locality and reduces 
execution overhead. This change is expected to provide a 1.2x to 1.5x speedup 
for workloads involving deeply nested data structures.
   
   ## What changes are included in this PR?
   
   This PR includes the following technical refactors in 
`native/core/src/execution/shuffle/row.rs`:
   
   * **Recursive Optimization**: Replaced the row-major `for` loop in the 
`DataType::Struct` match arm with a recursive call to `append_columns`.
   * **Validity Separation**: Implemented a single-pass extraction of the 
nested validity (null-mask) for the entire batch of rows before processing 
child fields, fulfilling the "Proposed Optimization" requirement.
   * **Field-Major Traversal**: Enabled the engine to dive into nested struct 
levels while remaining in the optimized field-major execution path.
   
   ## How are these changes tested?
   
   These changes were verified using the existing native test suite to ensure 
functional parity with the previous row-major implementation:
   
   1. **Unit Tests**: Ran `cargo test --lib execution::shuffle::row` which 
passed existing struct-related test cases:
       * `test_append_null_row_to_struct_builder`
       * `test_append_null_struct_field_to_struct_builder`
   2. **Compilation Check**: Verified with `cargo check` to ensure zero errors 
or warnings in the `native/core` crate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to