Re: [I] Deduplicate Spark function code with native/default datafusion function code [datafusion]

via GitHub Sun, 02 Nov 2025 23:21:46 -0800


Jefffrey commented on issue #17964:
URL: https://github.com/apache/datafusion/issues/17964#issuecomment-3479215281


   > For `rint -> round`, from what I understand, `rint` uses float 
[round_ties_even](https://doc.rust-lang.org/std/primitive.f32.html#method.round_ties_even)
 while `round` uses float 
[round](https://doc.rust-lang.org/std/primitive.f32.html#method.round), which 
would produce difference results for floats half-way between integers. Is that 
intended, or what was the reason?
   
   This is indeed the intention of the ticket; to find out these differences 
and ensure they are either eliminated if there is no reason for them to be 
different, or to document these differences so in future we know that they are 
meant to be different. In this case it's worth going through the commit/PR 
history to see if there are reasons given, otherwise falling back to testing 
against Spark to confirm if these are meant to be different.
   
   > I did not seem to find unit tests in 
`datafusion/functions-nested/src/make_array.rs` or 
`datafusion/spark/src/functions/spark_array.rs`. Are tests kept in other files?
   
   We prefer keeping our tests as sqllogictests, see: 
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
   
   > For `coalesce_types`, `make_array` uses `type_union_resolution` while 
`spark_array` uses `comparison_coersion` (with folding), both of which are 
implemented in `datafusion_expr_common::type_coercion::binary`. It seems to me 
that they are very similar for the purpose of coalescing types and `make_array` 
can also use `comparison_coersion`, but wanted to check in case I misunderstood 
anything. Also are there unit tests for `type_union_resolution`? did not seem 
to find under `type_coercion::binary` module.
   
   Unfortunately I'm not sure of this myself.
   
   > `make_array` uses field name `LIST_FIELD_DEFAULT_NAME` (which is "item") 
while `spark_array` uses field name `ARRAY_FIELD_DEFAULT_NAME` (which is 
"element"). Are they intended to be different?
   
   I'm inclined to believe they are meant to be different, but some 
investigation & accompanying documentation would be very helpful for us in the 
future to confirm the fact.
   
   > `make_array` does not handle case for `DataType::LargeList` while 
`spark_array` does. Is there a reason for that?
   
   I actually think you may have spotted a bug here 🤔 
   
   The `spark_array` implementation does seem to suggest it _may_ return a 
`LargeList` for certain inputs, but this conflicts with the stated return type 
which should always return a `List` (not a `LargeList`):
   
   
https://github.com/apache/datafusion/blob/9238779f45418e07108079aeef3b51de85e9fb8f/datafusion/spark/src/function/array/spark_array.rs#L75-L108
   
   - Side note: it's also confusing how both `return_type` and 
`return_field_from_args` are implemented since usually it's one or the other, 
so could do with some refactoring here
   
   Might be worth mocking an example test which could exhibit this bug (where 
it returns a `LargeList` in spite of it stating its return type as regular 
`List`)
   
   See the tests for the spark version of array function here: 
https://github.com/apache/datafusion/blob/9238779f45418e07108079aeef3b51de85e9fb8f/datafusion/sqllogictest/test_files/spark/array/array.slt
   
   And use this as reference for how we can construct a `LargeList` via SLT 
tests:
   
   
https://github.com/apache/datafusion/blob/9238779f45418e07108079aeef3b51de85e9fb8f/datafusion/sqllogictest/test_files/array.slt#L7946-L7949
   
   @jizezhang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Deduplicate Spark function code with native/default datafusion function code [datafusion]

Reply via email to