tustvold commented on issue #1047:
URL: https://github.com/apache/arrow-rs/issues/1047#issuecomment-1572425176

   So I've been playing around with this and the major challenge is avoiding a 
huge amount of API churn / boilerplate
   
   Take the signature
   
   `add_dyn(a: &dyn Array, b: &dyn Array) -> Result<ArrayRef>`
   
   Its not clear how to convert this to a Datum based model. One option would be
   
   ```
   add_dyn(a: Datum<'_>, b: Datum<'_>) -> Result<ArrayRef>
   ```
   
   Where `Datum` is something like
   
   ```
   enum Datum<'a> {
       Array(&'a dyn Array),
       Scalar(&'a dyn Scalar)
   }
   ```
   
   But this has a couple of issues
   
   * Callsites now have to explicitly wrap there arguments in Datum
   * There is no way to return a scalar
   
   Making `Datum` a trait doesn't help here either, because the specialization 
rules prevent blanked implementations for both `T: Scalar` and `T: Array`.
   
   Another option would be to make the methods generic, with `impl 
Into<Datum>`, but this also has downsides of
   
   * Runs into same blanket impl issues as deriving `Datum` trait
   * Kernels resulting in significant additional codegen
   
   Taking a step back I had a potentially controversial thought, **why not just 
treat a single element array as a scalar array**?
   
   This would have some pretty compelling advantages:
   
   * No changes to type signatures necessary
   * Unary kernels like casting just work with no modification
   * Complete type coverage for no effort
   
   The obvious downside is the representation is not very memory efficient. I 
think the question boils down to what is the purpose of the scalar 
representation, is it:
   
   1. To allow more efficient kernels where one side is known to be a scalar, 
e.g. scalar comparison, etc...
   2. Provide an efficient type-erased representation for row-oriented 
operations like grouping
   3. Provide efficient scalar operations
   
   My 2 cents is that 2. is a use-case better served by the row representation, 
and 3. is beyond the scope of a vectorized execution engine, and therefore 1. 
is the target for this feature. As such I think this is perfectly acceptable 
approach. The overheads of the slightly less efficient representation will be 
more than outweighed by the costs of the dynamic dispatch alone.
   
   What do people think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to