HyukjinKwon commented on PR #38624:
URL: https://github.com/apache/spark/pull/38624#issuecomment-1687611468

   Once we want them to be fully supported properly as a user-facing API, we 
probably should think about Arrow versions of 
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs
 and 
https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-function-apis.
 That's what I meant by too many API.
   
   Yes, we marked it as a developer API and that `mapInArrow` can cover most of 
direct Arrow usecases. That's why I suggest to use that for a workaround such 
as `repartition(grouping_cols).mapInArrow`. Only exception that `mapInArrow` 
can't cover cogrouping. In this case, you can use somewhat slower 
`mapInPandas`. And this and only this is the benefit Arrow versions would bring.
   
   To be extra clear,
   - Once we add one, we should also think about all other variants. From what 
I understood, your argument applies to the all variants.
    - The only benefit of adding this API are: direct Arrow usage with 
cogrouping (in which you can work around with `mapInPandas`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to