dtenedor opened a new pull request, #42420:
URL: https://github.com/apache/spark/pull/42420

   ### What changes were proposed in this pull request?
   
   This PR implements query execution support for for the PARTITION BY and 
ORDER BY clauses for UDTF TABLE arguments.
   
   * The query planning support was added in 
https://github.com/apache/spark/pull/42100 and 
https://github.com/apache/spark/pull/42174 and 
https://github.com/apache/spark/pull/42351. After those changes, the planner 
added a projection to compute the PARTITION BY expressions, plus a repartition 
operator, plus a sort operator.
   * In this PR, the Python executor receives the indexes of these expressions 
within the input table's rows, and compares the values of the projected 
partitioning expressions between consecutive rows.
   * When the values change, this marks the boundary between partitions, and so 
we call the UDTF instance's `terminate` method, then destroy it and create a 
new one for the next partition.
   
   ### Why are the changes needed?
   
   This brings full end-to-end execution for the PARTITION BY and/or ORDER BY 
clauses for UDTF TABLE arguments.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR adds end-to-end testing in `test_udtf.py`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to