Omega359 opened a new issue, #13552:
URL: https://github.com/apache/datafusion/issues/13552

   ### Is your feature request related to a problem or challenge?
   
   One of the things I've been thinking about when working on utf8view support 
in udfs is what exactly datafusion should support in terms of function 
signature types. Currently we haven't formalized what we expect functions to 
support and thus string functions are not consistent in terms of what they 
accept and what they generate. 
   
   @alamb also asked whether the level of specialization of a function was 
indeed required in 
https://github.com/apache/datafusion/pull/13403#issuecomment-2491701015 and if 
a proposal to have guidelines for string functions should be made. This is my 
attempt at such a proposal.
   
   ### Describe the solution you'd like
   
   In the context of this proposal string functions are UDF's that accept and 
produce strings. This does exclusively mean udf's in `functions/string` and 
`functions/unicode`
   
   I would like to propose the following for DataFusion:
   
   1. String functions **MUST** accept both scalar and array values for all 
data arguments (vs config such as regex's 'flags' arguments).
   2. String functions **MUST** accept scalar values for all config arguments 
but *MAY* accept both scalar and array if appropriate for the function.
   3. String functions **MUST** accept all valid string types for all data 
arguments. To ease implementation the type for all data arguments  **SHOULD** 
be coerced to be the largest type among all the data arguments. 
   4. String functions **MAY** choose to allow non-contiguous data types for 
data arguments but it is **NOT RECOMMENDED** for functions with 3 or more 
arguments.
   5. String functions **MAY** choose to output in Utf8View instead of Utf8 if 
DataFusion is configured with `schema_force_view_types` == `true`. Otherwise 
string functions **SHOULD** output string results in the same type as the 
received primary data argument.
   6. String functions **SHOULD** rely on type coercion to handle non-string 
data. For example, concat('ab', 2, 'cc'). 
   7. String functions **MUST** handle non-control unicode textual character 
classes unless the function explicitly is designed for a particular character 
set (ascii for example)
   8. String functions **SHOULD NOT** attempt to specially handle unicode 
grapheme characters unless it's directly related to the function requirements.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   I am unsure about whether all string functions should be required to handle 
dictionary types or not. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to