dtenedor commented on code in PR #44678:
URL: https://github.com/apache/spark/pull/44678#discussion_r1450999974
##########
python/pyspark/sql/udtf.py:
##########
@@ -133,12 +133,28 @@ class AnalyzeResult:
If non-empty, this is a sequence of expressions that the UDTF is
specifying for Catalyst to
sort the input TABLE argument by. Note that the 'partitionBy' list
must also be non-empty
in this case.
+ acquireExecutionMemoryMbRequested: long
+ If this is not None, this represents the amount of memory in megabytes
that the UDTF should
+ request from each Spark executor that it runs on. Then the UDTF takes
responsibility to use
+ at most this much memory, including all allocated objects. The purpose
of this functionality
+ is to prevent executors from crashing by running out of memory due to
the extra memory
+ consumption invoked by the UDTF's 'eval' and 'terminate' and 'cleanup'
methods. Spark will
+ then call 'TaskMemoryManager.acquireExecutionMemory' with the
requested number of megabytes.
+ acquireExecutionMemoryMbActual: long
+ If there is a task context available, Spark will assign this field to
the number of
+ megabytes returned from the call to the
TaskMemoryManager.acquireExecutionMemory' method, as
+ consumed by the UDTF's'__init__' method. Therefore, its 'eval' and
'terminate' and 'cleanup'
+ methods will know it thereafter and can ensure to bound memory usage
to at most this number.
+ Note that there is no effect if the UDTF's 'analyze' method assigns a
value to this; it will
+ be overwritten.
"""
Review Comment:
The `TaskMemoryManager.acquireExecutionMemory` API is a memory reservation
system. The idea is that an operator that will consume memory in the future
should call this method beforehand with the peak expected future memory usage
on the executor. This comprises a memory reservation such that the sum of all
such reservations may not reach or exceed the total memory available on the
executor (less some fixed overhead for the executor's own data structures).
For the case of Python UDTFs, the function should set `AnalyzeResult.
acquireExecutionMemoryMbRequested` to the max expected memory use of the python
process, including memory allocated by the function itself as well as any
imported libraries, etc.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]