ankurdave commented on a change in pull request #29744:
URL: https://github.com/apache/spark/pull/29744#discussion_r487588391
##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -816,6 +816,10 @@ public boolean append(Object kbase, long koff, int klen,
Object vbase, long voff
} catch (SparkOutOfMemoryError oom) {
canGrowArray = false;
}
+ } else if (numKeys >= growthThreshold && longArray.size() / 2 >=
MAX_CAPACITY) {
+ // The map cannot grow because doing so would cause it to exceed
MAX_CAPACITY. Instead, we
+ // prevent the map from accepting any more new elements.
+ canGrowArray = false;
Review comment:
Regarding the memory setting: I agree that whether or not an OOM will
occur is dependent on the memory setting, but I would argue that (1) the user
should not be required to tune memory settings to avoid failing queries, and
(2) many reasonable settings will cause an OOM.
The problem arises when there are between 2^28+1 and 2^29 keys. If we need
to spill, `UnsafeKVExternalSorter` will attempt to allocate between 8 and 16
GiB. Therefore the user will either need to increase the memory per task by
approximately this amount so that this allocation will succeed, or decrease the
memory per task sufficiently that the map will never reach `MAX_CAPACITY` in
the first place. This seems to create quite a wide range of problematic
settings.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]