viirya commented on a change in pull request #29744:
URL: https://github.com/apache/spark/pull/29744#discussion_r487582849
##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -816,6 +816,10 @@ public boolean append(Object kbase, long koff, int klen,
Object vbase, long voff
} catch (SparkOutOfMemoryError oom) {
canGrowArray = false;
}
+ } else if (numKeys >= growthThreshold && longArray.size() / 2 >=
MAX_CAPACITY) {
+ // The map cannot grow because doing so would cause it to exceed
MAX_CAPACITY. Instead, we
+ // prevent the map from accepting any more new elements.
+ canGrowArray = false;
Review comment:
I think we won't exceed `MAX_CAPACITY` at all. We have some restrictions
to prevent it.
The root cause, I think, is once the map reaches `MAX_CAPACITY`, we won't
grow the map even the `numKeys` is more than `growthThreshold`. So we could add
elements to the map to more than load factor (0.5) of it. The map cannot be
used for sorting later in `UnsafeKVExternalSorter`. Although
`UnsafeKVExternalSorter` will allocate new array in the case, it could hit OOM
and fails the query.
This change favors reusing the array for sorting, but it prevents the map
accepting new elements more early, half of the array is not used after the map
capacity reaches `MAX_CAPACITY`. But by doing this, we will do spilling early
too.
When I was fixing the customer application before, we don't see the OOM even
the payload reached `MAX_CAPACITY`. Is it possibly due to the memory setting on
your payload?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]