andygrove opened a new issue, #3435:
URL: https://github.com/apache/datafusion-comet/issues/3435
## Summary
After #3401 set `CometScanWrapper.isFfiSafe = true`, there is a small
performance regression for dictionary-encoded columns (most notably SmallInt,
which almost always stays dictionary-encoded in Parquet due to its low
cardinality).
## Root Cause
In `copy_or_unpack_array` (`native/core/src/execution/operators/copy.rs`),
the dictionary unpacking path always performs a defensive `copy_array` after
`cast_with_options`, regardless of the `CopyMode`:
```rust
DataType::Dictionary(_, value_type) => {
let options = CastOptions::default();
// We need to copy the array after `cast` because arrow-rs `take` kernel
which is used
// to unpack dictionary array might reuse the input array's null buffer.
Ok(copy_array(&cast_with_options(
array,
value_type.as_ref(),
&options,
)?))
}
```
In `UnpackOrClone` mode (used when `isFfiSafe=true`), the caller has already
transferred ownership of the original array's memory to native, so null buffer
reuse from the `take` kernel is safe. The extra deep copy is unnecessary.
## Proposed Fix
Make the defensive copy conditional on the mode:
```rust
DataType::Dictionary(_, value_type) => {
let options = CastOptions::default();
let unpacked = cast_with_options(array, value_type.as_ref(), &options)?;
if mode == &CopyMode::UnpackOrDeepCopy {
// We need to copy the array after `cast` because arrow-rs `take`
kernel which is
// used to unpack dictionary array might reuse the input array's
null buffer.
Ok(copy_array(&unpacked))
} else {
// In UnpackOrClone mode, the caller owns the original array's
memory so
// null buffer reuse from the take kernel is safe.
Ok(unpacked)
}
}
```
This eliminates the redundant deep copy of the full flat array for
dictionary-encoded columns, restoring performance parity with the previous code
path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]