Re: [PR] [SPARK-46638][Python] Create Python UDTF API to acquire execution memory for function evaluation [spark]

via GitHub Mon, 15 Jan 2024 07:52:03 -0800


benhurdelhey commented on code in PR #44678:
URL: https://github.com/apache/spark/pull/44678#discussion_r1452538115



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonUDTFExec.scala:
##########
@@ -137,4 +150,46 @@ trait EvalPythonUDTFExec extends UnaryExecNode {
       }
     }
   }
+
+  lazy val memoryConsumer: Option[PythonUDTFMemoryConsumer] = {
+    if (TaskContext.get() != null) {
+      Some(PythonUDTFMemoryConsumer(udtf))
+    } else {
+      None
+    }
+  }
+}
+
+/**
+ * This class takes responsibility to allocate execution memory for UDTF 
evaluation before it begins
+ * and free the memory after the evaluation is over.
+ *
+ * Background: If the UDTF's 'analyze' method returns an 'AnalyzeResult' with 
a non-empty
+ * 'acquireExecutionMemoryMb' value, this value represents the amount of 
memory in megabytes that
+ * the UDTF should request from each Spark executor that it runs on. Then the 
UDTF takes
+ * responsibility to use at most this much memory, including all allocated 
objects. The purpose of
+ * this functionality is to prevent executors from crashing by running out of 
memory due to the
+ * extra memory consumption invoked by the UDTF's 'eval' and 'terminate' and 
'cleanup' methods.
+ *
+ * In this class, Spark calls 'TaskMemoryManager.acquireExecutionMemory' with 
the requested number
+ * of megabytes, and when Spark calls __init__  of the UDTF later, it updates 
the
+ * acquiredExecutionMemory integer passed into the UDTF constructor to the 
actual number returned
+ * from 'TaskMemoryManager.acquireExecutionMemory', so the 'eval' and 
'terminate' and 'cleanup'
+ * methods know it and can ensure to bound memory usage to at most this number.
+ */
+case class PythonUDTFMemoryConsumer(udtf: PythonUDTF)
+  extends MemoryConsumer(TaskContext.get().taskMemoryManager(), 
MemoryMode.ON_HEAP) {
+  private val BYTES_PER_MEGABYTE = 1024 * 1024

Review Comment:
   let's decide for either megabyte (then use 1000 * 1000 here) or call it 
mebibyte and change it to MiB everywhere :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46638][Python] Create Python UDTF API to acquire execution memory for function evaluation [spark]

Reply via email to