jorgecarleitao edited a comment on pull request #7967: URL: https://github.com/apache/arrow/pull/7967#issuecomment-683318690
Thanks @alamb for writing this up. I think that it is a good summary. I generally agree with option 3 and 1, as both offer the same user experience, and IMO just differ on our own internal implementation. What I like about 1 and 3 is that the user's intent is logically clear from `concat([c1, c2])`: concatenate the two columns (and I do not care if they are large or not; just do your best). If a user wants a specific type, they can cast them explicitly. With this said, as an exercise, let me try to write how I imagine an interface could look like for option 3, just to check if I have the same understanding as you do. First, the actual functions: ``` pub fn concat(args: &[StringArray]) -> Result<StringArray>; pub fn concat_large(args: &[LargeStringArray]) -> Result<LargeStringArray>; ``` Both implementations would be similar code-wise, and thus we would probably write a macro, and use it in each function, for DRY purposes. #### Physical evaluation Next, we need to physically evaluate them. This is a bit less clear to me in option 3. Is it the idea that we have one physical node per variation, whose return type is not `ArrayRef`? Or is the idea that the physical node would be the same for both? The closest to current DataFusion's design is something like ``` impl PhysicalExpr for Concatenate { fn evaluate(&self, batch: &RecordBatch) -> Result<ArrayRef> { // evaluate the arguments let inputs = self .args .iter() .map(|e| e.evaluate(batch)) .collect::<Result<Vec<_>>>()?; // evaluate the function match inputs.data_type() { DataType::Utf8 => { // let inputs = downcast every arg in inputs to StringArray concat(inputs) } DataType::LargeUtf8 => { // let inputs = downcast every arg in inputs to LargeStringArray concat_large(inputs) } } } } ``` I will run with this, but I can see that this may not be what you mean with option 3 (could you please clarify this?). #### Physical planning During physical planning, e.g. `Concatenate::new(exprs)`, we need to check that this wont cause execution errors at compute time due to type mismatches. This is because both input (`RecordBatch`) and return (`ArrayRef`) are dynamically typed. So, we need to specify what types `RecordBatch` can contain (utf8 or largeutf8 in this case). To check that the return type aligns with the next node's valid types, we also need to know the return type of `Concatenate` based on its argument types. So, something like ``` fn concatenate_valid_types() -> Vec<DataType>; fn concatenate_return_type(arg_types: Vec<DataType>) -> Result<DataType>; ``` and these need to be aligned (code-wise) with `Concatenate::evaluate`. #### Logical planning Finally, logical planning. We could use `Expr::Concatenate`, or maybe `Expr::ScalarFunction` (for all built-ins). Since `Concatenate`'s return type differs based on its input, we have the same problem as before: we need to declare the return type of `Expr::Concatenate`, which we can only know from its input. So, again, we would need two functions like the ones above (or re-use them) to test that the type is a valid type, and to return the valid type. Again, I am trying to run through the option and see how it would look like. Is this what you are thinking, or is it something different? Regardless of whether it is or not, the critical moment for me in all of this is where do we perform the `downcast`, because that is the moment we go from dynamic to static, and my understanding is that the main difference between 1 and 3 is where that downcast happens. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org