[GitHub] [spark] HyukjinKwon commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

via GitHub Sun, 20 Aug 2023 23:13:31 -0700


HyukjinKwon commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1685712229


   `mapInArrow` is marked as a developer API, and my initial intention was to 
avoid adding the arrow version of that everywhere - in theory `mapInArrow` can 
cover all the cases except cogrouping (I admit I missed this case when we add 
`mapInArrow`). That's why I am hesitant to add this now.
   
   What about pandas UDF and iteration friends? I think it's too much to add 
all Arrow versions to address a couple of corner cases and performance:
   
    - The benefit of the performance is even not super critical since we 
already copy Arrow here and there. In fact, vilna Arrow itself copies when it's 
converted to something else by default.
   - repartition-mapInArrow will cover grouping case. The only leftover case is 
cogrouping that has a bit of overhead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

Reply via email to