andygrove opened a new issue, #3435:
URL: https://github.com/apache/datafusion-comet/issues/3435

   ## Summary
   
   After #3401 set `CometScanWrapper.isFfiSafe = true`, there is a small 
performance regression for dictionary-encoded columns (most notably SmallInt, 
which almost always stays dictionary-encoded in Parquet due to its low 
cardinality).
   
   ## Root Cause
   
   In `copy_or_unpack_array` (`native/core/src/execution/operators/copy.rs`), 
the dictionary unpacking path always performs a defensive `copy_array` after 
`cast_with_options`, regardless of the `CopyMode`:
   
   ```rust
   DataType::Dictionary(_, value_type) => {
       let options = CastOptions::default();
       // We need to copy the array after `cast` because arrow-rs `take` kernel 
which is used
       // to unpack dictionary array might reuse the input array's null buffer.
       Ok(copy_array(&cast_with_options(
           array,
           value_type.as_ref(),
           &options,
       )?))
   }
   ```
   
   In `UnpackOrClone` mode (used when `isFfiSafe=true`), the caller has already 
transferred ownership of the original array's memory to native, so null buffer 
reuse from the `take` kernel is safe. The extra deep copy is unnecessary.
   
   ## Proposed Fix
   
   Make the defensive copy conditional on the mode:
   
   ```rust
   DataType::Dictionary(_, value_type) => {
       let options = CastOptions::default();
       let unpacked = cast_with_options(array, value_type.as_ref(), &options)?;
       if mode == &CopyMode::UnpackOrDeepCopy {
           // We need to copy the array after `cast` because arrow-rs `take` 
kernel which is
           // used to unpack dictionary array might reuse the input array's 
null buffer.
           Ok(copy_array(&unpacked))
       } else {
           // In UnpackOrClone mode, the caller owns the original array's 
memory so
           // null buffer reuse from the take kernel is safe.
           Ok(unpacked)
       }
   }
   ```
   
   This eliminates the redundant deep copy of the full flat array for 
dictionary-encoded columns, restoring performance parity with the previous code 
path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to