alamb commented on pull request #7967: URL: https://github.com/apache/arrow/pull/7967#issuecomment-683275858
I wonder if it would be helpful to think about types in logical and physical plans differently. Let's take @jorgecarleitao's `concat` as an example and test the output of concatenating two columns together. I think the ideal experience is to use an expression that appears in a query / data frame, like the following. ``` 'FooBar' = concat(A, B) ``` And let's consider what has to happen when running the `ExecutionPlan` when A and B are both `Utf8` columns or are both `LargeUtf8` columns. ### Execution Plan Option 1 Use a single function that can handle either Utf8 or LargeUtf8 inputs ( what I think this PR is proposing) Pros: 1. The same function information can be used both in logical and physical plans, and the semantics are clear Cons 1. The type signature (aka what input types will be supplied and what output types produced) needs to be computed both at plan time (e.g. type coercion logic so the output type is known) and at runtime (the actual UDF needs to switch on input type) and those computations need to remain consistent 2. The implementation of the operator might be more complicated (as all the combinations of supported types must be enumerated and the actual calculation needs to support all those combinations) ### Execution Plan Option 2 Use distinct functions for different type signatures (e.g. `concat` and `concat_large`). I *think* this is what Spark does (https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#register-a-function-as-a-udf) Pros; 1. Each implementation is simpler and unambiguous as to its type 2. The type coercion logic is also simpler as there is only a single type signature that could match Cons: 1. More verbose user experience as now users have to specify the exact function they want (e.g. `concat` or `concat_large`). I think this is what @jorgecarleitao was describing with "functions that return different types" ### Execution Plan Option 3 (proposal) Have a single expression that can take multiple type signatures, and provide multiple UDFs that each have a single type signature. So that would mean the expression: ``` 'FooBar' = concat(A, B) ``` If A and B were Utf8, the physical expression would be ``` 'FooBar' = concat_utf8(A, B) ``` And if A and B were LongUtf8, then the physical expression would be ``` 'FooBar' = concat_long_utf8(A, B) ``` And perhaps the type coercion logic would be responsible for picking which specific function to use Pros: 1. Allows for reasonable user experience (can use `concat`) 2. The types are all well known at execution plan type. Cons: 1. Would be a larger change to DataFusion 2. May not align with what other systems do. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org