Chen-Yuan-Lai commented on issue #14432: URL: https://github.com/apache/datafusion/issues/14432#issuecomment-2818684471
Hi @eliaperantoni, I spent some time investigating additional locations in the codebase where argument numbers for most SQL functions are checked. I've roughly classified them into four categories: 1. `data_types_with_scalar_udf`, `data_types_with_aggregate_udf`, `data_types_with_window_udf`, `data_types` : *These provide the outer layer for checking zero arguments https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L52-L61 2. `get_valid_types` * Called by the functions mentioned above, with two approaches to check argument numbers in match branches: 1. Checking separately (`TypeSignature::Uniform`, `TypeSignature::VariadicAny`, `TypeSignature::Nullary`, `TypeSignature::Any`) https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L642-L643 https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L657-L661 https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L688-L693 https://github.com/apache/datafusion/blob/f07fb109142a9174365bc0e886c63d2103ef7e4a/datafusion/expr/src/type_coercion/functions.rs#L697-L708 2. Using the `function_length_check()` inner function (`TypeSignature::String`, `TypeSignature::Numeric`, `TypeSignature::Comparable`, `TypeSignature::Coercible`) 3. `try_coerce_types` * Also called by the above functions, but after `valid_types()`. Signature variants that don't check argument numbers in the match branch in `valid_types()` will return errors here (e.g., `TypeSignature::Exact`, `TypeSignature::OneOf`). 4. Implemented methods of `ScalarUDFImpl` and `AggregateUDFImpl trait`: *In these methods, when `take_function_args` is called, it also checks the argument count automatically. However, if `take_function_args` is not called, they may implement the argument count checking themself * `return_type` * `return_type_from_args` * `invoke_with_args` * `simplify` * `coerce_types` * ... Based on these findings, I've discovered that the same argument checking logic is redundantly implemented in many places. Furthermore, some of these checks might never be triggered. For example, `pi()` has `TypeSignature::Nullary` and checks the argument number in both categories 1 and 2, but category 1 would return a zero argument error before category 2 is ever reached. Therefore, I propose creating a new function to integrate checking logic for all signature variants in a single place. This approach offers several benefits: This function would be used in `data_types_with_scalar_udf`, `data_types_with_aggregate_udf`, `data_types_with_window_udf`, and `data_types`, enabling early argument count checking. We can attach diagnostic information in a single location. For signature variants that don't check argument counts and only rely on `try_coerce_types` for validation, this function can add checking logic to return appropriate errors. Regarding your proposal: > The difficulty lies in `Signature::OneOf` I think. For example `log` accepts `[Float]` and `[Float, Float]`. So the error Datafusion returns today for `[Int]` is simply that it didn't match any of the possibilities. While that's correct, perhaps it's not the best UX. Maybe that error message should be a fallback, more than anything. > > Proposal: > > * If the actual arguments don't match the length of any possibility (e.g. `log(1,2,3)`, then we error out with "log cannot take 3 arguments". > * If the actual arguments do match in length one of the possibility, but the types are misaligned (e.g. `log('foo')`) then we error out with "log cannot take argument types [utf8]" I believe this new function can also handle the case of `Signature::OneOf` to check argument count without knowing all the possibilities. For type mismatch errors, these might be better implemented in `try_coerce_types`. I've nearly completed this function in my draft PR, but I'd like your feedback on whether this approach makes sense. I'm happy to discuss more details or explore alternatives. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org