[GitHub] [arrow] jorgecarleitao commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

GitBox Sat, 29 Aug 2020 10:22:37 -0700


jorgecarleitao commented on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-683318690



   Thanks @alamb for writing this up. I think that it is a good summary.
   
   I generally agree with option 3 and 1, as both offer the same user 
experience, and IMO just differ on our own internal implementation. What I like 
about 1 and 3 is that the user's intent is logically clear from `concat([c1, 
c2])`: concatenate the two columns (and I do not care if they are large or not; 
just do your best). If a user wants a specific type, they can cast them 
explicitly.
   
   With this said, as an exercise, let me try to write how I imagine an 
interface could look like for option 3, just to check if I have the same 
understanding as you do.
   
   First, the actual functions:
   
   ```
   pub fn concat(args: &[StringArray]) -> Result<StringArray>;
   pub fn concat_large(args: &[LargeStringArray]) -> Result<LargeStringArray>;
   ```
   
   Both implementations would be similar code-wise, and thus we would probably 
write a macro, and use it in each function, for DRY purposes.
   
   #### Physical evaluation
   
   Next, we need to physically evaluate them. This is a bit less clear to me in 
option 3. Is it the idea that we have one physical node per variation, whose 
return type is not `ArrayRef`?
   
   Or is the idea that the physical node would be the same for both?
   
   The closest to current DataFusion's design is something like
   
   ```
   impl PhysicalExpr for Concatenate {
       fn evaluate(&self, batch: &RecordBatch) -> Result<ArrayRef> {
           // evaluate the arguments
           let inputs = self
               .args
               .iter()
               .map(|e| e.evaluate(batch))
               .collect::<Result<Vec<_>>>()?;
       
           // evaluate the function
           match inputs.data_type() {
               DataType::Utf8 => {
                   // let inputs = downcast every arg in inputs to StringArray
                   concat(inputs)
               }
               DataType::LargeUtf8 => {
                   // let inputs = downcast every arg in inputs to 
LargeStringArray
                   concat_large(inputs)
               }
           }
       }
   }
   ```
   
   I will run with this, but I can see that this may not be what you mean with 
option 3 (could you please clarify this?).
   
   ### Physical planning
   
   During physical planning, e.g. `Concatenate::new(exprs)`, we need to check 
that this wont cause execution errors at compute time due to type mismatches. 
This is because both input (`RecordBatch`) and return (`ArrayRef`) are 
dynamically typed.
   
   So, we need to specify what types `RecordBatch` can contain (utf8 or 
largeutf8 in this case). To check that the return type aligns with the next 
node's valid types, we also need to know the return type of `Concatenate` based 
on its argument types. So, something like
   
   ```
   fn concatenate_valid_types() -> Vec<DataType>;
   fn concatenate_return_type(arg_types: Vec<DataType>) -> Result<DataType>;
   ```
   
   and these need to be aligned (code-wise) with `Concatenate::evaluate`.
   
   #### Logical planning
   
   Finally, logical planning. We could use `Expr::Concatenate`, or maybe 
`Expr::ScalarFunction` (for all built-ins). Since `Concatenate`'s return type 
differs based on its input, we have the same problem as before: we need to 
declare the return type of `Expr::Concatenate`, which we can only know from its 
input. So, again, we would need two functions like the ones above (or re-use 
them) to test that the type is a valid type, and to return the valid type.
   
   Again, I am trying to run through the option and see how it would look like. 
Is this what you are thinking, or is it something different?
   
   Regardless of whether it is or not, the critical moment for me in all of 
this is where do we perform the `downcast`, because that is the moment we go 
from dynamic to static, and my understanding is that the main difference 
between 1 and 3 is where that downcast happens.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorgecarleitao commented on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

Reply via email to