[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

via GitHub Mon, 21 Aug 2023 00:24:20 -0700


ion-elgreco commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1685790842


   > `mapInArrow` is marked as a developer API, and my initial intention was to 
avoid adding the arrow version of that everywhere - in theory `mapInArrow` can 
cover all the cases except cogrouping (I admit I missed this case when we add 
`mapInArrow`). That's why I am hesitant to add this now.
   > 
   > What about pandas UDF and iteration friends? I think it's too much to add 
all Arrow versions to address a couple of corner cases and performance:
   > 
   >  - The benefit of the performance is even not super critical since we 
already copy Arrow here and there. In fact, vilna Arrow itself copies when it's 
converted to something else by default.
   > - repartition-mapInArrow will cover grouping case. The only leftover case 
is cogrouping that has a bit of overhead.
   
   What's the design philosophy behind having to go through pandas first with 
possible loss of data and precision, since pandas don't have proper strict type 
casting, while there is a more efficient path available?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ion-elgreco commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

Reply via email to