jorisvandenbossche commented on PR #12590: URL: https://github.com/apache/arrow/pull/12590#issuecomment-1089895876
> There are times the execution engine will know that a column had the exact same value for every row. > As an optimization, the execution engine operates on ExecBatch instead of RecordBatch. Each column in an exec batch can have one of two shapes. A scalar column has a single value for the entire batch. An array column has an actual array (as you would expect in a record batch). > ... > So then, if the expression were "SELECT MyUDF(A, C)" we would look for a "scalar function" with a <scalar, scalar> kernel. If the expression were "SELECT MyUDF(A, B)" we would look for a "scalar function" with a <scalar, array> kernel. If the expression were "SELECT MyUDF(B, B)" we would look for a "scalar function" with a <array, array> kernel. Thanks, that's a very clear explanation of why a scalar version of a "scalar kernel" in addition to a array version is useful. Now, personally, putting my hat on as naive python user, I would expect that: 1) By default, this is handled automatically if I only register an array version of the scalar kernel (as David points to the internal helper, a scalar can be passed as a length-1 array to the array version of the scalar kernel; although this trick will typically only work for unary kernels) 2) As optimization, I can optionally also register a scalar version of my kernel in addition to the array version (or a mixed array/scalar or scalar/array in case of a binary kernel, etc). But in that case I should be able to register it for the same name. (and to be clear, I don't expect those aspects to be addressed in this first PR, just to get a good understanding of the issue. As David mentions, it might also makes sense to start with only the array version in this PR) One more thing: the register function is currently _only_ for "scalar kernels", but I suppose we want to extend this in the future to aggregation and vector kernels? But should that then already be included in the interface somehow? (either a keyword where you specify "scalar" for the kernel type, or rename the function to be clear it is only about scalar kernels) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
