[GitHub] [spark] ankurdave commented on a change in pull request #29744: [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from exceeding growth threshold

GitBox Sun, 13 Sep 2020 16:01:57 -0700


ankurdave commented on a change in pull request #29744:
URL: https://github.com/apache/spark/pull/29744#discussion_r487588391




##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -816,6 +816,10 @@ public boolean append(Object kbase, long koff, int klen, 
Object vbase, long voff
           } catch (SparkOutOfMemoryError oom) {
             canGrowArray = false;
           }
+        } else if (numKeys >= growthThreshold && longArray.size() / 2 >= 
MAX_CAPACITY) {
+          // The map cannot grow because doing so would cause it to exceed 
MAX_CAPACITY. Instead, we
+          // prevent the map from accepting any more new elements.
+          canGrowArray = false;

Review comment:
       Regarding the memory setting: I agree that whether or not an OOM will 
occur is dependent on the memory setting, but I would argue that (1) the user 
should not be required to tune memory settings to avoid failing queries, and 
(2) many reasonable settings will cause an OOM.
   
   The problem arises when there are between 2^28+1 and 2^29 keys. If we need 
to spill, `UnsafeKVExternalSorter` will attempt to allocate between 8 and 16 
GiB. Therefore the user will either need to increase the memory per task by 
approximately this amount so that this allocation will succeed, or decrease the 
memory per task sufficiently that the map will never reach `MAX_CAPACITY` in 
the first place. This seems to create quite a wide range of problematic 
settings.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ankurdave commented on a change in pull request #29744: [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from exceeding growth threshold

Reply via email to