[GitHub] [spark] viirya commented on a change in pull request #29744: [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from exceeding growth threshold

GitBox Sun, 13 Sep 2020 15:04:13 -0700


viirya commented on a change in pull request #29744:
URL: https://github.com/apache/spark/pull/29744#discussion_r487582849




##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -816,6 +816,10 @@ public boolean append(Object kbase, long koff, int klen, 
Object vbase, long voff
           } catch (SparkOutOfMemoryError oom) {
             canGrowArray = false;
           }
+        } else if (numKeys >= growthThreshold && longArray.size() / 2 >= 
MAX_CAPACITY) {
+          // The map cannot grow because doing so would cause it to exceed 
MAX_CAPACITY. Instead, we
+          // prevent the map from accepting any more new elements.
+          canGrowArray = false;

Review comment:
       I think we won't exceed `MAX_CAPACITY` at all. We have some restrictions 
to prevent it.
   
   The root cause, I think, is once the map reaches `MAX_CAPACITY`, we won't 
grow the map even the `numKeys` is more than `growthThreshold`. So we could add 
elements to the map to more than load factor (0.5) of it. The map cannot be 
used for sorting later in `UnsafeKVExternalSorter`. Although 
`UnsafeKVExternalSorter` will allocate new array in the case, it could hit OOM 
and fails the query.
   
   This change favors reusing the array for sorting, but it prevents the map 
accepting new elements more early, half of the array is not used after the map 
capacity reaches `MAX_CAPACITY`. But by doing this, we will do spilling early 
too.
   
   When I was fixing the customer application before, we don't see the OOM even 
the payload reached `MAX_CAPACITY`. Is it possibly due to the memory setting on 
your payload?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #29744: [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from exceeding growth threshold

Reply via email to