jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-683318690
Thanks @alamb for writing this up. I think that it is a good summary.
I generally agree with option 3 and 1, as both offer the same user
experience, and IMO just differ on our own internal implementation. What I like
about 1 and 3 is that the user's intent is logically clear from `concat([c1,
c2])`: concatenate the two columns (and I do not care if they are large or not;
just do your best). If a user wants a specific type, they can cast them
explicitly.
With this said, as an exercise, let me try to write how I imagine an
interface could look like for option 3, just to check if I have the same
understanding as you do.
First, the actual functions:
```
pub fn concat(args: &[StringArray]) -> Result<StringArray>;
pub fn concat_large(args: &[LargeStringArray]) -> Result<LargeStringArray>;
```
Both implementations would be similar code-wise, and thus we would probably
write a macro, and use it in each function, for DRY purposes.
#### Physical evaluation
Next, we need to physically evaluate them. This is a bit less clear to me in
option 3. Is it the idea that we have one physical node per variation, whose
return type is not `ArrayRef`?
Or is the idea that the physical node would be the same for both?
The closest to current DataFusion's design is something like
```
impl PhysicalExpr for Concatenate {
fn evaluate(&self, batch: &RecordBatch) -> Result<ArrayRef> {
// evaluate the arguments
let inputs = self
.args
.iter()
.map(|e| e.evaluate(batch))
.collect::<Result<Vec<_>>>()?;
// evaluate the function
match inputs.data_type() {
DataType::Utf8 => {
// let inputs = downcast every arg in inputs to StringArray
concat(inputs)
}
DataType::LargeUtf8 => {
// let inputs = downcast every arg in inputs to
LargeStringArray
concat_large(inputs)
}
}
}
}
```
I will run with this, but I can see that this may not be what you mean with
option 3 (could you please clarify this?).
### Physical planning
During physical planning, e.g. `Concatenate::new(exprs)`, we need to check
that this wont cause execution errors at compute time due to type mismatches.
This is because both input (`RecordBatch`) and return (`ArrayRef`) are
dynamically typed.
So, we need to specify what types `RecordBatch` can contain (utf8 or
largeutf8 in this case). To check that the return type aligns with the next
node's valid types, we also need to know the return type of `Concatenate` based
on its argument types. So, something like
```
fn concatenate_valid_types() -> Vec<DataType>;
fn concatenate_return_type(arg_types: Vec<DataType>) -> Result<DataType>;
```
and these need to be aligned (code-wise) with `Concatenate::evaluate`.
#### Logical planning
Finally, logical planning. We could use `Expr::Concatenate`, or maybe
`Expr::ScalarFunction` (for all built-ins). Since `Concatenate`'s return type
differs based on its input, we have the same problem as before: we need to
declare the return type of `Expr::Concatenate`, which we can only know from its
input. So, again, we would need two functions like the ones above (or re-use
them) to test that the type is a valid type, and to return the valid type.
Again, I am trying to run through the option and see how it would look like.
Is this what you are thinking, or is it something different?
Regardless of whether it is or not, the critical moment for me in all of
this is where do we perform the `downcast`, because that is the moment we go
from dynamic to static, and my understanding is that the main difference
between 1 and 3 is where that downcast happens.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]