goodwanghan commented on PR #38624: URL: https://github.com/apache/spark/pull/38624#issuecomment-1685705080
> Thanks for the clarification. Actually repartition plus presort should work. I think technically, this is very doable, the performance should also be decent. But I think this is an essential programming interface that the official pyspark should have (and given you already have `mapInArrow`, `applyInArrow` seems to be a natural expectation from users). It's important also because it is a semantic that is totally independent from pandas. You underlying implementation of pandas udf is all based on arrow, I feel it is not even necessary to call them pandas udfs (I don't expect you to change the names, just to share my opinion) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
