ankurdave commented on a change in pull request #29744:
URL: https://github.com/apache/spark/pull/29744#discussion_r487584414



##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -816,6 +816,10 @@ public boolean append(Object kbase, long koff, int klen, 
Object vbase, long voff
           } catch (SparkOutOfMemoryError oom) {
             canGrowArray = false;
           }
+        } else if (numKeys >= growthThreshold && longArray.size() / 2 >= 
MAX_CAPACITY) {
+          // The map cannot grow because doing so would cause it to exceed 
MAX_CAPACITY. Instead, we
+          // prevent the map from accepting any more new elements.
+          canGrowArray = false;

Review comment:
       I agree. Once we reach `MAX_CAPACITY`, we have a choice: we can either 
(a) exceed the load factor, or (b) we can spill.
   
   The code currently does (a). If the number of groups ultimately turns out to 
be less than `MAX_CAPACITY`, then this may avoid spilling. (Performance will 
degrade as the load factor approaches 1.0, but it will still be better than 
spilling.) However, if the number of groups is larger than `MAX_CAPACITY`, or 
if we need more memory to store the records, then we will be unable to spill 
and the query will fail.
   
   It seems to me that it would be better to prevent queries from failing, as 
(b) does. This does mean we would spill earlier for a narrow range of queries.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to