HyukjinKwon commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1686008186
> What's the design philosophy behind having to go through pandas first with possible loss of data and precision, since pandas don't have proper strict type casting, while there is a more efficient path available? Arrow was considered as an internal format initially, and that's the whole reason why pandas came up first. In fact, the number of pandas users are (much) higher given some stats I get given, and is informally considered as the standard TBH. It's too late to deprecate/remote pandas API, and switch the standard to Arrow in any event. > Going through Pandas requires users to install Pandas though they are using a different Arrow-based dataset API. It will requires the driver side to install both pandas and Arrow. Maybe pandas alone can be missing on executors but I think it's sort of a minor point (since nobody actually complains in the mailing list or JIRA as far as I can see). > That would require the user to implement what GroupedIterator in Spark does. Nah, the operation within the function can perform a groupby operation, e.g., `pandas.groupby().apply` then it will do the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
