Chen-Yuan-Lai commented on issue #14432:
URL: https://github.com/apache/datafusion/issues/14432#issuecomment-2818684471

   Hi @eliaperantoni, I spent some time investigating additional locations in 
the codebase where argument numbers for most SQL functions are checked. I've 
roughly classified them into four categories:
   
   1.  `data_types_with_scalar_udf`, `data_types_with_aggregate_udf`, 
`data_types_with_window_udf`, `data_types` :
        *These provide the outer layer for checking zero arguments
        
https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L52-L61
   
   2. `get_valid_types`
       * Called by the functions mentioned above, with two approaches to check 
argument numbers in match branches:
           1.   Checking separately  (`TypeSignature::Uniform`, 
`TypeSignature::VariadicAny`, `TypeSignature::Nullary`,  `TypeSignature::Any`)
   
https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L642-L643
   
https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L657-L661
   
https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L688-L693
   
https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L697-L708
           2. Using the `function_length_check()` inner function  
(`TypeSignature::String`, `TypeSignature::Numeric`, 
`TypeSignature::Comparable`, `TypeSignature::Coercible`)
   
   3. `try_coerce_types`
       * Also called by the above functions, but after `valid_types()`. 
Signature variants that don't check argument numbers in the match branch in 
`valid_types()` will return errors here (e.g., `TypeSignature::Exact`, 
`TypeSignature::OneOf`).
   
   4. Implemented methods of `ScalarUDFImpl` and `AggregateUDFImpl trait`: 
       *In these methods, when `take_function_args` is called, it also checks 
the argument count automatically. However, if `take_function_args` is not 
called, they may implement the argument count checking themself
           * `return_type`
           * `return_type_from_args`
           *  `invoke_with_args`
           *  `simplify`
           *  `coerce_types`
           *  ...
   
   Based on these findings, I've discovered that the same argument checking 
logic is redundantly implemented in many places. Furthermore, some of these 
checks might never be triggered. For example, `pi()` has 
`TypeSignature::Nullary` and checks the argument number in both categories 1 
and 2, but category 1 would return a zero argument error before category 2 is 
ever reached.
   
   Therefore, I propose creating a new function to integrate checking logic for 
all signature variants in a single place. This approach offers several benefits:
   
   This function would be used in `data_types_with_scalar_udf`, 
`data_types_with_aggregate_udf`, `data_types_with_window_udf`, and 
`data_types`, enabling early argument count checking.
   We can attach diagnostic information in a single location.
   For signature variants that don't check argument counts and only rely on 
`try_coerce_types` for validation, this function can add checking logic to 
return appropriate errors.
   
   Regarding your proposal:
   > The difficulty lies in `Signature::OneOf` I think. For example `log` 
accepts `[Float]` and `[Float, Float]`. So the error Datafusion returns today 
for `[Int]` is simply that it didn't match any of the possibilities. While 
that's correct, perhaps it's not the best UX. Maybe that error message should 
be a fallback, more than anything.
   > 
   > Proposal:
   > 
   > * If the actual arguments don't match the length of any possibility (e.g. 
`log(1,2,3)`, then we error out with "log cannot take 3 arguments".
   > * If the actual arguments do match in length one of the possibility, but 
the types are misaligned (e.g. `log('foo')`) then we error out with "log cannot 
take argument types [utf8]"
   
   I believe this new function can also handle the case of `Signature::OneOf` 
to check argument count without knowing all the possibilities. For type 
mismatch errors, these might be better implemented in `try_coerce_types`.
   
   I've nearly completed this function in my draft PR, but I'd like your 
feedback on whether this approach makes sense. I'm happy to discuss more 
details or explore alternatives. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to