HyukjinKwon commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1687611468
Once we want them to be fully supported properly as a user-facing API, we probably should think about Arrow versions of https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs and https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-function-apis. That's what I meant by too many API. Yes, we marked it as a developer API and that `mapInArrow` can cover most of direct Arrow usecases. That's why I suggest to use that for a workaround such as `repartition(grouping_cols).mapInArrow`. Only exception that `mapInArrow` can't cover cogrouping. In this case, you can use somewhat slower `mapInPandas`. And this and only this is the benefit Arrow versions would bring. To be extra clear, - Once we add one, we should also think about all other variants. From what I understood, your argument applies to the all variants. - The only benefit of adding this API are: direct Arrow usage with cogrouping (in which you can work around with `mapInPandas`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
