jorisvandenbossche commented on PR #12590:
URL: https://github.com/apache/arrow/pull/12590#issuecomment-1089895876

   > There are times the execution engine will know that a column had the exact 
same value for every row.  
   > As an optimization, the execution engine operates on ExecBatch instead of 
RecordBatch. Each column in an exec batch can have one of two shapes. A scalar 
column has a single value for the entire batch. An array column has an actual 
array (as you would expect in a record batch).
   > ...
   > So then, if the expression were "SELECT MyUDF(A, C)" we would look for a 
"scalar function" with a <scalar, scalar> kernel. If the expression were 
"SELECT MyUDF(A, B)" we would look for a "scalar function" with a <scalar, 
array> kernel. If the expression were "SELECT MyUDF(B, B)" we would look for a 
"scalar function" with a <array, array> kernel.
   
   Thanks, that's a very clear explanation of why a scalar version of a "scalar 
kernel" in addition to a array version is useful. Now, personally, putting my 
hat on as naive python user, I would expect that:
   
   1) By default, this is handled automatically if I only register an array 
version of the scalar kernel (as David points to the internal helper, a scalar 
can be passed as a length-1 array to the array version of the scalar kernel; 
although this trick will typically only work for unary kernels)
   2) As optimization, I can optionally also register a scalar version of my 
kernel in addition to the array version (or a mixed array/scalar or 
scalar/array in case of a binary kernel, etc). But in that case I should be 
able to register it for the same name.
   
   (and to be clear, I don't expect those aspects to be addressed in this first 
PR, just to get a good understanding of the issue. As David mentions, it might 
also makes sense to start with only the array version in this PR)
   
   One more thing: the register function is currently _only_ for "scalar 
kernels", but I suppose we want to extend this in the future to aggregation and 
vector kernels? But should that then already be included in the interface 
somehow? (either a keyword where you specify "scalar" for the kernel type, or 
rename the function to be clear it is only about scalar kernels)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to