ion-elgreco commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1685790842
> `mapInArrow` is marked as a developer API, and my initial intention was to avoid adding the arrow version of that everywhere - in theory `mapInArrow` can cover all the cases except cogrouping (I admit I missed this case when we add `mapInArrow`). That's why I am hesitant to add this now. > > What about pandas UDF and iteration friends? I think it's too much to add all Arrow versions to address a couple of corner cases and performance: > > - The benefit of the performance is even not super critical since we already copy Arrow here and there. In fact, vilna Arrow itself copies when it's converted to something else by default. > - repartition-mapInArrow will cover grouping case. The only leftover case is cogrouping that has a bit of overhead. What's the design philosophy behind having to go through pandas first with possible loss of data and precision, since pandas don't have proper strict type casting, while there is a more efficient path available? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
