HyukjinKwon edited a comment on pull request #26783: URL: https://github.com/apache/spark/pull/26783#issuecomment-951504965
> For the second one, I guess there might be more requirements than a map-style API? Yeah, I worry about this too. I thought that at least people would be able to do it though (given that Python RDD APIs are created on the top of one `RDD.mapPartitions`). To naturally support all cases we should probably make it as a UDF .. but I was hesitant about adding it as `arrow_udf` because we will have to take care of other restrictions, and variants like aggregation, window, etc all together, and thought that might not be worthwhile - I was initially skeptical about this API because I thought that Arrow is rather an internal format instead of user-facing. So this made me propose one (developer) API that doesn’t require considering other restrictions (e.g., the length of input should be the same as output's in case of scalar UDF in `select`) or variants. I just tend to think that it might be worthwhile to have this one generalized version given that it has been requested some times, and the reason seems making sense, but still does not have a very strong opinion. I am checking w/ other people here :-). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
