[PR] perf: Improve shuffle performance with complex types [WIP] [datafusion-comet]

via GitHub Mon, 26 Jan 2026 08:08:54 -0800


andygrove opened a new pull request, #3289:
URL: https://github.com/apache/datafusion-comet/pull/3289


   ## Summary
   
   This PR optimizes native shuffle performance for complex types (arrays and 
nested structs). These optimizations reduce type dispatch overhead and improve 
cache locality during shuffle operations.
   
   **Closes #3225**
   
   ## Changes
   
   ### 1. Array Element Iteration Optimization (commit e32dd525489)
   
   Optimizes `SparkUnsafeArray` iteration with slice-based append for primitive 
types:
   
   - **Non-nullable path**: Uses `append_slice()` for optimal memcpy-style bulk 
copy
   - **Nullable path**: Uses pointer iteration with efficient null bitset 
reading
   
   **Supported types**: i8, i16, i32, i64, f32, f64, date32, timestamp
   
   **Benchmark results (10K elements)**:
   
   | Type | Baseline | Optimized | Speedup |
   |------|----------|-----------|---------|
   | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** |
   | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** |
   | i64/no_nulls | 6.15µs | 1.22µs | **5x** |
   | i64/with_nulls | 16.41µs | 16.41µs | 1x |
   | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** |
   | f64/with_nulls | 16.52µs | 16.21µs | 1.02x |
   
   ### 2. Struct Field Processing with Field-Major Order (commit 471fb2ac143)
   
   Optimizes struct field processing by using field-major instead of row-major 
order:
   
   **Previous approach**: For each row, iterate over all fields and call 
`append_field()` which does type matching for EVERY field in EVERY row. For N 
fields and M rows = N×M type matches.
   
   **New approach**:
   1. First pass: Loop over rows, build struct validity
   2. Second pass: For each field, get typed builder once, then process all 
rows for that field
   
   This reduces type dispatch from O(rows × fields) to O(fields).
   
   ### 3. Nested Struct Field-Major Processing (commit f3da0dcdbe9)
   
   Extends field-major optimization to recursively handle nested Struct fields:
   
   **Previously**: Nested structs fell back to row-major processing via 
`append_field`, losing the benefit of field-major processing at each nesting 
level.
   
   **Now**:
   - Add `append_nested_struct_fields_field_major` helper function for 
recursive processing
   - For nested Struct fields: collect addresses/sizes in one pass, build 
validity, then recursively apply field-major processing
   - List and Map fields continue to fall back to `append_field` 
(variable-length elements are harder to optimize)
   
   **Expected impact**: 1.2-1.5x speedup for deeply nested struct types, with 
benefits multiplying with nesting depth.
   
   ## Files Changed
   
   ### `native/core/src/execution/shuffle/list.rs`
   - Add slice-based `append_to_builder` for primitive array types
   - Implement bulk copy via `append_slice()` for non-nullable arrays
   - Implement optimized pointer iteration for nullable arrays
   - Support for: Int8, Int16, Int32, Int64, Float32, Float64, Date32, 
TimestampMicrosecond
   
   ### `native/core/src/execution/shuffle/row.rs`
   - Add `append_struct_fields_field_major()` function for field-major struct 
processing
   - Add `append_nested_struct_fields_field_major()` helper for recursive 
nested struct handling
   - Update `append_columns()` to use field-major processing for struct columns
   - Separate `DataType::Struct` case from List/Map in field-major processing
   
   ### `native/core/benches/array_conversion.rs` (new)
   - Benchmarks for array element iteration
   - Tests various primitive types (i32, i64, f64, date32, timestamp)
   - Tests both nullable and non-nullable configurations
   - Tests different array sizes (1K, 10K elements)
   
   ### `native/core/benches/struct_conversion.rs` (new)
   - Benchmarks for struct column processing
   - Tests flat structs with varying field counts (5, 10, 20 fields)
   - Tests 2-level nested structs (`Struct<Struct<int64 fields>>`)
   - Tests 3-level nested structs (`Struct<Struct<Struct<int64 fields>>>`)
   - Tests different row counts (1K, 10K rows)
   
   ## Test Plan
   
   - [x] All existing Rust tests pass
   - [x] Benchmarks run successfully
   - [ ] JVM tests pass (TODO)
   
   ## How to Run Benchmarks
   
   ```bash
   # Array conversion benchmarks
   cargo bench --bench array_conversion
   
   # Struct conversion benchmarks
   cargo bench --bench struct_conversion
   
   # Run specific benchmark
   cargo bench --bench struct_conversion -- nested_struct_conversion
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: Improve shuffle performance with complex types [WIP] [datafusion-comet]

Reply via email to