viirya commented on a change in pull request #26828: [SPARK-30198][Core] 
BytesToBytesMap does not grow internal long array as expected
URL: https://github.com/apache/spark/pull/26828#discussion_r356721936
 
 

 ##########
 File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
 ##########
 @@ -741,7 +741,9 @@ public boolean append(Object kbase, long koff, int klen, 
Object vbase, long voff
         longArray.set(pos * 2 + 1, keyHashcode);
         isDefined = true;
 
-        if (numKeys >= growthThreshold && longArray.size() < MAX_CAPACITY) {
 
 Review comment:
   Ah, sure.
   
   The client of `BytesToBytesMap`, like HashAggregate, will call `lookup` to 
find a `Location` to write value. The returned location will be used to do 
append (`Location.append`).  Everytime after we append a key/value, we check if 
it is time to grow internal array and grow up it if needed.
   
   `lookup` delegates looking up keys to `safeLookup`. Its control flow looks 
like:
   
   ```java
   int pos = hash & mask;
   int step = 1;
   // an infinite loop until find matching key or empty slot.
   while (true) {
     if (empty slot found) {
       ...
     } else {
       // check if matching key
       ...
     }
     // increase pos and step
     pos = (pos + step) & mask;
     step++;
   }
   ```
   
   So the job hangs in this loop because it can not find any empty location as 
the internal array is full.
   
   We early stop growing the internal array due to wrongly check array size at:
   
   ```java
   if (numKeys >= growthThreshold && longArray.size() < MAX_CAPACITY) {
     ...
   }
   ```
   
   Another point 
https://github.com/apache/spark/pull/26828#discussion_r356370350 is we may want 
to set `canGrowArray` to false once we are close to max capacity, so we can 
avoid infinite loop again.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to