LantaoJin opened a new issue, #58:
URL: https://github.com/apache/datafusion-java/issues/58

   ### Describe the bug
   
   `ScalarFunction.argTypes()` returns `List<ArrowType>` and `returnType()` 
returns `ArrowType` 
(`core/src/main/java/org/apache/datafusion/ScalarFunction.java:47, :50`). Java 
Arrow's `ArrowType` is a *leaf marker* for the type kind: for primitives like 
`Int32` or `Float64` it is self-describing, but for nested types (`List`, 
`Struct`, `Map`, `FixedSizeList`) the element / member / key / value types live 
on the parent `Field`'s `children` list, not inside `ArrowType` itself. 
`ArrowType.List` is literally a no-field marker class.
   
   That mismatch means a Java UDF author has no way to declare a typed nested 
signature. The closest they can write is:
   
   ```java
   public List<ArrowType> argTypes() {
     return List.of(new ArrowType.List());  // says "list" -- cannot say "of 
Int32"
   }
   ```
   
   When this is passed to `SessionContext.registerUdf(ScalarUdf)` the 
registration path at 
`core/src/main/java/org/apache/datafusion/SessionContext.java:385-389` 
constructs the signature schema as:
   
   ```java
   fields.add(new Field("return", FieldType.nullable(returnType), null));
   for (int i = 0; i < argTypes.size(); i++) {
     fields.add(new Field("arg" + i, FieldType.nullable(argTypes.get(i)), 
null));
   }
   ```
   
   The `null` children list is the bug: Arrow's IPC writer rejects the 
malformed `List` field during `serializeSchemaIpc(...)` before the schema ever 
crosses JNI. The user sees a low-level `IllegalArgumentException: Lists have 
one child Field. Found: none`.
   
   This blocks the entire family of nested-type UDFs that exist as built-ins in 
DataFusion's `datafusion-functions-nested` crate (`array_length`, 
`cardinality`, `array_has`, `array_position`, `flatten`, `map_keys`, 
`map_values`, `arrays_zip`, ...). Anyone porting Spark UDFs over `ArrayType` / 
`StructType` / `MapType` columns to DataFusion-Java hits this on the first 
attempt.
   
   The Rust API does not have this problem: `DataType::List(Arc<Field>)` 
carries the child field inline, so 
`Signature::exact(vec![DataType::List(Arc::new(Field::new("item", 
DataType::Int32, true)))], ...)` round-trips with full structure.
   
   ### To Reproduce
   
   ```java
   static final class ListLength implements ScalarFunction {
     public String name() { return "java_list_length"; }
     public List<ArrowType> argTypes() { return List.of(new ArrowType.List()); }
     public ArrowType returnType() { return new ArrowType.Int(32, true); }
     public Volatility volatility() { return Volatility.IMMUTABLE; }
     public FieldVector evaluate(BufferAllocator allocator, List<FieldVector> 
args, int rowCount) {
       /* ... */
     }
   }
   
   new SessionContext().registerUdf(new ScalarUdf(new ListLength()));
   // throws:
   //   IllegalArgumentException: Lists have one child Field. Found: none
   //   at SessionContext.serializeSchemaIpc(SessionContext.java:398)
   //   at SessionContext.registerUdf(SessionContext.java:391)
   ```
   
   ### Expected behavior
   
   A UDF whose argument or return type is a nested Arrow type registers 
successfully and is callable from SQL with full element-type information 
preserved end-to-end (Java → JNI → Rust `Signature::exact`).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to