[GitHub] [spark] HyukjinKwon commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

via GitHub Tue, 22 Aug 2023 00:18:30 -0700


HyukjinKwon commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1687611468

Once we want them to be fully supported properly as a user-facing API, we
probably should think about Arrow versions of
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs
and
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-function-apis.
That's what I meant by too many API.

Yes, we marked it as a developer API and that `mapInArrow` can cover most of
direct Arrow usecases. That's why I suggest to use that for a workaround such
as `repartition(grouping_cols).mapInArrow`. Only exception that `mapInArrow`
can't cover cogrouping. In this case, you can use somewhat slower
`mapInPandas`. And this and only this is the benefit Arrow versions would bring.

To be extra clear,
- Once we add one, we should also think about all other variants. From what
I understood, your argument applies to the all variants.
- The only benefit of adding this API are: direct Arrow usage with
cogrouping (in which you can work around with `mapInPandas`)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on pull request #38624: [SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup

Reply via email to