HyukjinKwon commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1686008186

   > What's the design philosophy behind having to go through pandas first with 
possible loss of data and precision, since pandas don't have proper strict type 
casting, while there is a more efficient path available?
   
   Arrow was considered as an internal format initially, and that's the whole 
reason why pandas came up first. In fact, the number of pandas users are (much) 
higher given some stats I get given, and is informally considered as the 
standard TBH.  It's too late to deprecate/remote pandas API, and switch the 
standard to Arrow in any event.
   
   > Going through Pandas requires users to install Pandas though they are 
using a different Arrow-based dataset API.
   
   It will requires the driver side to install both pandas and Arrow. Maybe 
pandas alone can be missing on executors but I think it's sort of a minor point 
(since nobody actually complains in the mailing list or JIRA as far as I can 
see).
   
   > That would require the user to implement what GroupedIterator in Spark 
does.
   
   Nah, the operation within the function can perform a groupby operation, 
e.g., `pandas.groupby().apply` then it will do the same.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to