andygrove opened a new pull request, #64:
URL: https://github.com/apache/datafusion-java/pull/64
## Which issue does this PR close?
Closes #62.
## Rationale for this change
DataFusion's Rust `ScalarUDFImpl::invoke_with_args` speaks `ColumnarValue`
(`Array` or `Scalar`) rather than raw Arrow arrays. The Java binding previously
materialised every scalar arg to a length-N array before crossing the JNI
boundary, which lost the scalar-vs-array distinction and forced nullary UDFs to
learn the batch row count by some out-of-band channel (the workaround proposed
in PR #57).
Aligning the Java API with the Rust enum eliminates the workaround: a
nullary UDF can return `ColumnarValue.scalar(...)` and the framework broadcasts
it, and a UDF that takes literals sees them as Scalars without per-row
duplication.
## What changes are included in this PR?
- New `ColumnarValue` sealed interface (`Array`/`Scalar` records, factory
enforcing length-1 invariant on scalars).
- New `ScalarFunctionArgs` record bundling `List<ColumnarValue>` and
`rowCount`.
- `ScalarFunction.evaluate` is now `evaluate(BufferAllocator,
ScalarFunctionArgs) -> ColumnarValue` (source-breaking).
- `JniBridge.invokeScalarUdf` rewritten to ship two struct arrays (length-N
Array args + length-1 Scalar args) plus a `byte[] argKinds` positional mask,
returning a `byte` indicating the result variant. JNI signature is now
`(Lorg/apache/datafusion/ScalarFunction;JJJJ[BJJI)B`.
- Native `invoke_with_args` no longer materialises scalars; it partitions
args by `ColumnarValue` variant and reconstructs the result from the returned
kind byte via `ScalarValue::try_from_array`.
- `AddOneExample` and `docs/source/user-guide/scalar-udf.md` updated; new
"Returning a Scalar" section added to the user guide.
## How are these changes tested?
`make test` — 135 tests pass (12 pre-existing skips). Existing
`ScalarUdfTest` cases (`AddOne`, `Concat`, `Square`, error paths, volatility
round-trip) adapted to the new signature, plus three new tests:
- `nullaryScalarReturnUdf_overMultiRowQuery_broadcasts` — a nullary
`java_pi` returns `ColumnarValue.scalar(...)` and the framework expands it
across rows, replacing the rowCount workaround.
- `scalarLiteralArg_arrivesAsScalarColumnarValue` — UDF asserts that a SQL
literal arrives as `ColumnarValue.Scalar` (length 1), proving scalar-ness
survives the FFI.
- `udfReturningScalar_isBroadcastByFramework` — explicit scalar-return path
test.
Also covered by `cargo clippy --all-targets --workspace -- -D warnings`
(clean) and `./mvnw spotless:check` (clean).
## Are there any user-facing changes?
Yes — source-breaking signature change to `ScalarFunction.evaluate`.
Implementations must:
Before:
```java
public FieldVector evaluate(BufferAllocator allocator, List<FieldVector>
args) {
IntVector in = (IntVector) args.get(0);
// ...
return out;
}
```
After:
```java
public ColumnarValue evaluate(BufferAllocator allocator, ScalarFunctionArgs
args) {
IntVector in = (IntVector) args.args().get(0).vector();
// ...
return ColumnarValue.array(out);
}
```
Nullary or broadcast-style UDFs can return `ColumnarValue.scalar(...)` over
a length-1 vector.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]