[GitHub] [arrow] westonpace commented on pull request #12590: ARROW-15639 [C++][Python] UDF Scalar Function Implementation

GitBox Tue, 05 Apr 2022 15:10:27 -0700


westonpace commented on PR #12590:
URL: https://github.com/apache/arrow/pull/12590#issuecomment-1089426784


   > I think that I am familiar with out usage of the term "scalar function" in 
our compute kernels, but AFAIK that's not really how it translates here.
   
   @jorisvandenbossche 
   
   Ah, I think I see your point now.  We are probably going to need even more 
exposition, though maybe we should save it for a PR adding to our user guide 
instead of trying to put everything into the function docs.
   
   There are times the execution engine will know that a column had the exact 
same value for every row.  For example, the `__filename` column.  Also, when 
reading a partitioned data source (e.g. /dataset/foo=7) we know that "foo" has 
the value 7 for every batch generated by that directory.
   
   As an optimization, the execution engine operates on ExecBatch instead of 
RecordBatch.  Each column in an exec batch can have one of two shapes.  A 
scalar column has a single value for the entire batch.  An array column has an 
actual array (as you would expect in a record batch).
   
   RecordBatch
   ```
   A | B | C
   1 | 2 | 5
   1 | 3 | 5
   1 | 4 | 5
   ```
   
   ExecBatch (length=3)
   ```
   A : Scalar(1)
   B : Array([2, 3, 4])
   C : Scalar(5)
   ```
   
   So then, if the expression were "SELECT MyUDF(A, C)" we would look for a 
"scalar function" with a <scalar, scalar> kernel.
   If the expression were "SELECT MyUDF(A, B)" we would look for a "scalar 
function" with a <scalar, array> kernel.  If the expression were "SELECT 
MyUDF(B, B)" we would look for a "scalar function" with a <array, array> kernel.
   
   ---
   
   Now, that being said, I wonder how much of this particular optimization we 
want to expose to python users for two reasons:
   
   1. This is a lot of exposition to have to give to someone that just wants to 
write a quick UDF.
   2. We don't yet generate scalar columns in too many cases.
   3. I wonder if this particular optimization is going to be around long term. 
 We recently discussed new in-memory encodings and one of them was an RLE 
encoding.  If we add support for that encoding then a "scalar column" is really 
just an "array column" encoded with RLE.
   
   CC @lidavidm Does the dispatcher automatically expand scalar columns if 
there is no scalar kernel but there is an array kernel?  Do you think we should 
expose scalar/array shape in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #12590: ARROW-15639 [C++][Python] UDF Scalar Function Implementation

Reply via email to