Re: [PR] [SPARK-53562][PYTHON] Limit Arrow batch sizes in `applyInArrow` and `applyInPandas` [spark]

via GitHub Sat, 18 Oct 2025 16:23:50 -0700


Kimahriman commented on PR #52303:
URL: https://github.com/apache/spark/pull/52303#issuecomment-3327772187


   > @Kimahriman Does your PR propose to "change" the current UDF signature 
from Pandas DataFrame (Arrow RecordBatch) to Iterator of Pandas DataFrame 
(Arrow RecordBatch), like we do in applyInPandasWithState and 
transformWithStateInPandas?
   > 
   > While I agree this is a right direction to deal with OOM issue, this 
significantly regresses the UX. applyInPandasWithState and 
transformWithStateInPandas have to deal with state anyway to extend the 
accumulation across batches, while other stateless operators don't need to do 
that. It might be even confusing to support both Pandas DataFrame vs 
Iterator[Pandas DataFrame] in the same operation.
   
   Yes/No, it just adds it as a new option. I already have a reworked version 
of my PR that adds a new eval type based on type hints instead of the way I am 
currently doing it. So it basically will work like normal Pandas UDFs that can 
take series/dataframes directly or take an iterator of series/dataframes. And 
it should be compatible with the changes to the JVM serialization side in this 
PR. It's still one function call per group so I don't think it's very confusing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53562][PYTHON] Limit Arrow batch sizes in `applyInArrow` and `applyInPandas` [spark]

Reply via email to