timarmstrong commented on a change in pull request #34186:
URL: https://github.com/apache/spark/pull/34186#discussion_r728282858



##########
File path: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java
##########
@@ -149,95 +149,100 @@ public long acquireExecutionMemory(long required, 
MemoryConsumer consumer) {
       // Try to release memory from other consumers first, then we can reduce 
the frequency of
       // spilling, avoid to have too many spilled files.
       if (got < required) {
-        // Call spill() on other consumers to release memory
-        // Sort the consumers according their memory usage. So we avoid 
spilling the same consumer
-        // which is just spilled in last few times and re-spilling on it will 
produce many small
-        // spill files.
+        logger.debug("Task {} need to spill {} for {}", taskAttemptId,
+                Utils.bytesToString(required - got), requestingConsumer);
+        // We need to call spill() on consumers to free up more memory. We 
want to optimize for two
+        // things:
+        // * Minimize the number of spill calls, to reduce the number of spill 
files and avoid small
+        //  spill files.
+        // * Avoid spilling more data than necessary - if we only need a 
little more memory, we may
+        //   not want to spill as much data as possible. Many consumers spill 
more than the
+        //   requested amount, so we can take that into account in our 
decisions.
+        // We use a heuristic that selects the smallest memory consumer with 
at least `required`
+        // bytes of memory in an attempt to balance these factors. It may work 
well if there are
+        // fewer larger requests, but can result in many small spills if there 
are many smaller
+        // requests.
+
+        // Build a map of consumer in order of memory usage to prioritize 
spilling. Assign current
+        // consumer a nominal memory usage of 0 so that it is always last in 
priority order.
         TreeMap<Long, List<MemoryConsumer>> sortedConsumers = new TreeMap<>();
         for (MemoryConsumer c: consumers) {

Review comment:
       You are right - it will not be in there if this is the first time it has 
called `acquireExecutionMemory()`, but in that case we don't need to consider 
spilling it. I'll extend the comment to clarify.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to