Re: [PR] Deprecate ScalarUDFImpl::return_type [datafusion]

via GitHub Tue, 17 Dec 2024 03:15:14 -0800


findepi commented on code in PR #13717:
URL: https://github.com/apache/datafusion/pull/13717#discussion_r1888341956



##########
datafusion/core/src/catalog_common/information_schema.rs:
##########
@@ -406,6 +406,7 @@ fn get_udf_args_and_return_types(
             .into_iter()
             .map(|arg_types| {
                 // only handle the function which implemented 
[`ScalarUDFImpl::return_type`] method
+                #[allow(deprecated)]

Review Comment:
   > ```rust
   > struct ReturnTypeArgs {
   >        // Argments, if available. This may not be available in some 
contexts
   >        // such as information_table schema queries
   >         args: Option<&[Expr]>
   >        // schema, if available. 
   >         schema: Option<&dyn ExprSchema>,
   >         arg_types: &[DataType],
   > }
   > ```
   
   
   
   the seemingly single method `udf.return_type_with_args`  will be invokable 
with `ReturnTypeArgs` argument, which may or may not carry expressions. If the 
answer depends on the expression presence, it can depend in two ways
   
   1. type is known when expressions are known (and are something particular), 
type is unknown otherwise
   2. type is A when expressions are known (and are something particular), type 
is B (!= A) otherwise
   
   
   The second result is clearly not acceptable from planning perspective 
(https://github.com/apache/datafusion/pull/13717#discussion_r1879725758). 
Retrieving a false type B is misleading. Type B cannot be used if the truth is 
type A. Which leads us to the conclusion that the first option is the only 
acceptable option.
   
   For the first option, we realize that for some functions -- like for some 
syntaxes -- type is known when inputs are known. This is normal in query 
engines. This is normally handled by analyzer piece of a query engine, nothing 
special about it. This means that from query planning perspective, "what is 
this UDF return type" is a perfectly valid question with a perfectly well 
defined answer, that analyzer needs to know, should capture and persist. Once 
answered, there is no reason to ask this question ever again.
   
   The remaining use-case seems to be `information_schema.routines` 
perspective. There, the return type is unknown (or is a set of possible types), 
because the callsite is undefined. 
   
   I am therefore not convinced the introduction of `return_type_with_args` is 
a good approach. It just hides the problem, without addressing it. It's better 
if the problem is clearly visible. IMO the way to address the problem is what 
outlined above -- analyzer needs to know the type, so it should capture and 
persist that information -- https://github.com/apache/datafusion/issues/12604.
   
   I believe we do not need to spend time on `return_type_with_args`, since it 
does not solve the crux of the problem.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Deprecate ScalarUDFImpl::return_type [datafusion]

Reply via email to