Omega359 opened a new issue, #10744:
URL: https://github.com/apache/datafusion/issues/10744

   ### Is your feature request related to a problem or challenge?
   
   During work on adding a 'safe' mode to to_timestamp and to_date functions 
I've come across an issue that I would like feedback on before proceeding.
   
   ### The feature
   Currently for timestamp and date parsing if a source string cannot be parsed 
using any of the provided chrono formats datafusion will return an error. This 
is normal for a database-type solution however it is not ideal for a system 
that is parsing billions of dates in batches - some of which are human entered. 
Systems such as Spark default to a null value for anything that cannot be 
parsed and this feature enables a mode ('safe' to mirror the name and same 
behaviour as CastOptions) that allows the to_timestamp* and to_date UDFs to 
have the same behaviour (return null on error). 
   
   ### The problem
   Since UDF functions have no context provided and there isn't a way I know of 
to statically get access to config to add the avove mentioned functionality I 
resorted to using a new constructor function to allow the UDF to switch 
behaviour:
   
   ```
   #[derive(Debug)]
   pub struct ToTimestampFunc {
       signature: Signature,
       /// how to handle cast or parsing failures, either return NULL 
(safe=true) or return ERR (safe=false)
       pub safe: bool,
   }
   
   impl ToTimestampFunc {
       pub fn new() -> Self {
           Self {
               signature: Signature::variadic_any(Volatility::Immutable),
               safe: false,
           }
       }
   
       pub fn new_with_safe(safe: bool) -> Self {
           Self {
               signature: Signature::variadic_any(Volatility::Immutable),
               safe,
           }
       }
   }
   ```
   
   To use the alternative 'safe' mode for these functions is as simple as 
   ```
   let to_timestamp = ToTimestampFunc::new_with_safe(true);
   session_context.register_udf(ScalarUDF::new_from_impl(to_timestamp));
   ```
   Unfortunately this only affect sql queries - any calls to the 
[to_timestamp(args: 
Vec<Expr>)](https://github.com/apache/datafusion/blob/09dde27be39ad054f85dfb5c37b7468a3f68d652/datafusion/functions/src/datetime/mod.rs#L257)
 function will not use the new definition as registered in the function 
registry. This is because that function and every other function like it use a 
static singleton instance that only uses a ::new() call to initial it and there 
is no way that I can see to replace that instance.
   
   
   
   ### Describe the solution you'd like
   
   I see a few possible solutions to this:
   
   - Acknowledge the difference in behaviour in documentation and recommend 
using `ctx.udf("to_timestamp").unwrap().call(args)` instead of the 
to_timestamp() function anytime 'safe' mode is required. This is less than 
ideal imho as it can lead to confusion and unintuitive behavior.
   - Provide a mechanism (function argument to invoke perhaps) to provide a 
SessionContext to all UDF's so that they can adjust behaviour based on 
configuration at call time. 
   - Update the to_timestamp() function and all similar functions to have 
alternatives that do not use the statically defined version of the UDF but 
rather create it on-demand or use the definition saved in the session state. As 
examples:
   ```
       // the existing function    
       pub fn to_timestamp(args: Vec<Expr>) -> Expr {
           super::to_timestamp().call(args)
       }
   
       pub fn to_timestamp_safe(args: Vec<Expr>) -> Expr {
           
ScalarUDF::new_from_impl(ToTimestampFunc::new_with_safe(true)).call(args)
       }
   
       // or
       pub fn to_timestamp_from_context(ctx &Arc<SessionContext>, args: 
Vec<Expr>) -> Expr {
           ctx.udf("to_timestamp").unwrap().call(args)
       }
   
      // varargs would have been a really nice solution as well but 🤷 
   ```
   - something else I haven't thought of.
   
   
   Any opinions, suggestions and critiques would be welcome.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to