harshitsaini17 opened a new pull request, #19195:
URL: https://github.com/apache/datafusion/pull/19195
## Which issue does this PR close?
Closes #19146
Part of #19144 (EPIC: fix nullability report for spark expression)
## Rationale for this change
The `bitmap_count` UDF was using the default `return_type` implementation
which does not preserve nullability information. This causes:
1. **Incorrect schema inference** - non-nullable binary inputs are
incorrectly marked as producing nullable Int64 outputs
2. **Missed optimization opportunities** - the query optimizer cannot apply
nullability-based optimizations when metadata is incorrect
3. **Inconsistent behavior** - other similar functions preserve nullability,
leading to unexpected differences
The `bitmap_count` function counts set bits in a binary input and returns an
Int64. The operation itself doesn't introduce nullability - if the input is
non-nullable, the output will always be non-nullable. Therefore, the output
nullability should match the input.
## What changes are included in this PR?
1. **Implemented `return_field_from_args`**: Creates a field with Int64 type
and the same nullability as the input field
2. **Updated `return_type`**: Now returns an error directing users to use
`return_field_from_args` instead (following DataFusion best practices)
3. **Added `FieldRef` import** to support returning field metadata
4. **Added nullability test**: Verifies that both nullable and non-nullable
inputs are handled correctly
## Are these changes tested?
Yes, this PR includes a new test `test_bitmap_count_nullability` that
verifies:
- Non-nullable Binary input produces non-nullable Int64 output
- Nullable Binary input produces nullable Int64 output
- Data type is correctly set to Int64 in both cases
Test results:
```
running 1 test
test function::bitmap::bitmap_count::tests::test_bitmap_count_nullability
... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
```
Additionally, all existing `bitmap_count` tests continue to pass, ensuring
backward compatibility.
## Are there any user-facing changes?
**Yes - Schema metadata improvement:**
Users will now see correct nullability information in the schema:
**Before (Bug):**
- Non-nullable binary input → nullable Int64 output (incorrect)
**After (Fixed):**
- Non-nullable binary input → non-nullable Int64 output (correct)
- Nullable binary input → nullable Int64 output (correct)
This is a **bug fix** that corrects schema metadata only - it does not
change the actual computation or introduce any breaking changes to the API.
**Impact:**
- Query optimizers can now make better decisions based on accurate
nullability information
- Schema validation will be more accurate
- No changes to function behavior or output values
---
## Code Changes Summary
### Modified File: `datafusion/spark/src/function/bitmap/bitmap_count.rs`
#### 1. Added Import
```rust
use arrow::datatypes::{DataType, FieldRef, Int16Type, Int32Type, Int64Type,
Int8Type};
```
#### 2. Updated return_type Method
```rust
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
internal_err!("return_field_from_args should be used instead")
}
```
#### 3. Added return_field_from_args Implementation
```rust
fn return_field_from_args(
&self,
args: datafusion_expr::ReturnFieldArgs,
) -> Result<FieldRef> {
use arrow::datatypes::Field;
// bitmap_count returns Int64 with the same nullability as the input
Ok(Arc::new(Field::new(
args.arg_fields[0].name(),
DataType::Int64,
args.arg_fields[0].is_nullable(),
)))
}
```
#### 4. Added Test
```rust
#[test]
fn test_bitmap_count_nullability() -> Result<()> {
use datafusion_expr::ReturnFieldArgs;
let bitmap_count = BitmapCount::new();
// Test with non-nullable binary field
let non_nullable_field = Arc::new(Field::new("bin", DataType::Binary,
false));
let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
arg_fields: &[Arc::clone(&non_nullable_field)],
scalar_arguments: &[None],
})?;
// The result should not be nullable (same as input)
assert!(!result.is_nullable());
assert_eq!(result.data_type(), &Int64);
// Test with nullable binary field
let nullable_field = Arc::new(Field::new("bin", DataType::Binary, true));
let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
arg_fields: &[Arc::clone(&nullable_field)],
scalar_arguments: &[None],
})?;
// The result should be nullable (same as input)
assert!(result.is_nullable());
assert_eq!(result.data_type(), &Int64);
Ok(())
}
```
---
## Verification Steps
1. **Run the new test:**
```bash
cargo test -p datafusion-spark test_bitmap_count_nullability --lib
```
2. **Run all bitmap_count tests:**
```bash
cargo test -p datafusion-spark bitmap_count --lib
```
3. **Run clippy checks:**
```bash
cargo clippy -p datafusion-spark --all-targets -- -D warnings
```
All checks pass successfully!
---
## Related Issues
- Closes: #19146
- Part of EPIC: #19144 (fix nullability report for spark expression)
- Similar fix: #19145 (shuffle function nullability)
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]