westonpace commented on PR #12590: URL: https://github.com/apache/arrow/pull/12590#issuecomment-1089426784
> I think that I am familiar with out usage of the term "scalar function" in our compute kernels, but AFAIK that's not really how it translates here. @jorisvandenbossche Ah, I think I see your point now. We are probably going to need even more exposition, though maybe we should save it for a PR adding to our user guide instead of trying to put everything into the function docs. There are times the execution engine will know that a column had the exact same value for every row. For example, the `__filename` column. Also, when reading a partitioned data source (e.g. /dataset/foo=7) we know that "foo" has the value 7 for every batch generated by that directory. As an optimization, the execution engine operates on ExecBatch instead of RecordBatch. Each column in an exec batch can have one of two shapes. A scalar column has a single value for the entire batch. An array column has an actual array (as you would expect in a record batch). RecordBatch ``` A | B | C 1 | 2 | 5 1 | 3 | 5 1 | 4 | 5 ``` ExecBatch (length=3) ``` A : Scalar(1) B : Array([2, 3, 4]) C : Scalar(5) ``` So then, if the expression were "SELECT MyUDF(A, C)" we would look for a "scalar function" with a <scalar, scalar> kernel. If the expression were "SELECT MyUDF(A, B)" we would look for a "scalar function" with a <scalar, array> kernel. If the expression were "SELECT MyUDF(B, B)" we would look for a "scalar function" with a <array, array> kernel. --- Now, that being said, I wonder how much of this particular optimization we want to expose to python users for two reasons: 1. This is a lot of exposition to have to give to someone that just wants to write a quick UDF. 2. We don't yet generate scalar columns in too many cases. 3. I wonder if this particular optimization is going to be around long term. We recently discussed new in-memory encodings and one of them was an RLE encoding. If we add support for that encoding then a "scalar column" is really just an "array column" encoded with RLE. CC @lidavidm Does the dispatcher automatically expand scalar columns if there is no scalar kernel but there is an array kernel? Do you think we should expose scalar/array shape in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
