itholic commented on PR #40525: URL: https://github.com/apache/spark/pull/40525#issuecomment-1501047078
Thank you for the feedback, @bjornjorgensen ! IMHO, it seems more reasonable to add `grpcio` as a dependency for the Pandas API on Spark instead of reverting all this change back (Oh, seems like you already open https://github.com/apache/spark/pull/40716 for this? 😄) The purpose of Spark Connect is to allow users to use existing PySpark project without any code changes through a remote client. Therefore, if a user is using the `pyspark.pandas` module in their existing code, it should work the same way through the remote client as well. I think we should support all the functionality of PySpark as much as possible including pandas API on Spark, since nobody can not sure whether existing PySpark users will use the Pandas API on Spark through Spark Connect or not at this point, not only the existing pandas users. Alternatively, we might be able to create completely separate package path for the Pandas API on Spark for Spark Connect. This would allow the existing Pandas API on Spark to be used without installing `grpcio`, but it would be much more overhead than simply changing the policy to add one package as an additional installation. WDYT? also cc @HyukjinKwon @grundprinzip @ueshin @zhengruifeng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
