[GitHub] [arrow] alamb commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

GitBox Sat, 29 Aug 2020 04:13:53 -0700


alamb commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-683275858



   I wonder if it would be helpful to think about types in logical and physical 
plans differently. 
   
   Let's take @jorgecarleitao's `concat` as an example and test the output of 
concatenating two columns together. I think the ideal experience is to use an 
expression that appears in a query / data frame, like the following. 
   
   ```
   'FooBar' = concat(A, B)
   ```
   
   And let's consider what has to happen when running the `ExecutionPlan` when  
A and B are both  `Utf8` columns or are both `LargeUtf8` columns. 
   
   
   ### Execution Plan Option 1 
   Use a single function that can handle either Utf8 or LargeUtf8 inputs ( what 
I think this PR is proposing)
   
   Pros:
   1. The same function information can be used both in logical and physical 
plans, and the semantics are clear
   
   Cons
   1. The type signature (aka what input types will be supplied and what output 
types produced) needs to be computed both at plan time (e.g. type coercion 
logic so the output type is known) and at runtime (the actual UDF needs to 
switch on input type) and those computations need to remain consistent
   2. The implementation of the operator might be more complicated (as all the 
combinations of supported types must be enumerated and the actual calculation 
needs to support all those combinations)
   
   ### Execution Plan Option 2
   Use distinct functions for different type signatures (e.g. `concat` and 
`concat_large`). I *think* this is what Spark does 
(https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#register-a-function-as-a-udf)
   
   Pros;
   1. Each implementation is simpler and unambiguous as to its type
   2. The type coercion logic is also simpler as there is only a single type 
signature that could match
   
   Cons:
   1. More verbose user experience as now users have to specify the exact 
function they want (e.g. `concat` or `concat_large`). I think this is what 
@jorgecarleitao  was describing with "functions that return different types"
   
   ### Execution Plan Option 3 (proposal)
   Have a single expression that can take multiple type signatures, and provide 
multiple UDFs that each have a single type signature. 
   
   So that would mean the expression:
   ```
   'FooBar' = concat(A, B)
   ```
   
   If A and B were Utf8, the physical expression would be 
   
   ```
   'FooBar' = concat_utf8(A, B)
   ```
   
   And if A and B were LongUtf8, then the physical expression would be
   ```
   'FooBar' = concat_long_utf8(A, B)
   ```
   
   And perhaps the type coercion logic would be responsible for picking which 
specific function to use
   
   Pros:
   1. Allows for reasonable user experience (can use `concat`)
   2. The types are all well known at execution plan type. 
   
   Cons:
   1. Would be a larger change to DataFusion
   2. May not align with what other systems do. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

Reply via email to