Kimahriman commented on PR #52303: URL: https://github.com/apache/spark/pull/52303#issuecomment-3327772187
> @Kimahriman Does your PR propose to "change" the current UDF signature from Pandas DataFrame (Arrow RecordBatch) to Iterator of Pandas DataFrame (Arrow RecordBatch), like we do in applyInPandasWithState and transformWithStateInPandas? > > While I agree this is a right direction to deal with OOM issue, this significantly regresses the UX. applyInPandasWithState and transformWithStateInPandas have to deal with state anyway to extend the accumulation across batches, while other stateless operators don't need to do that. It might be even confusing to support both Pandas DataFrame vs Iterator[Pandas DataFrame] in the same operation. Yes/No, it just adds it as a new option. I already have a reworked version of my PR that adds a new eval type based on type hints instead of the way I am currently doing it. So it basically will work like normal Pandas UDFs that can take series/dataframes directly or take an iterator of series/dataframes. And it should be compatible with the changes to the JVM serialization side in this PR. It's still one function call per group so I don't think it's very confusing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
