viirya commented on a change in pull request #26828: [SPARK-30198][Core]
BytesToBytesMap does not grow internal long array as expected
URL: https://github.com/apache/spark/pull/26828#discussion_r356721936
##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -741,7 +741,9 @@ public boolean append(Object kbase, long koff, int klen,
Object vbase, long voff
longArray.set(pos * 2 + 1, keyHashcode);
isDefined = true;
- if (numKeys >= growthThreshold && longArray.size() < MAX_CAPACITY) {
Review comment:
Ah, sure.
The client of `BytesToBytesMap`, like HashAggregate, will call `lookup` to
find a `Location` to write value. The returned location will be used to do
append (`Location.append`). Everytime after we append a key/value, we check if
it is time to grow internal array and grow up it if needed.
`lookup` delegates looking up keys to `safeLookup`. Its control flow looks
like:
```java
int pos = hash & mask;
int step = 1;
// an infinite loop until find matching key or empty slot.
while (true) {
if (empty slot found) {
...
} else {
// check if matching key
...
}
// increase pos and step
pos = (pos + step) & mask;
step++;
}
```
So the job hangs in this loop because it can not find any empty location as
the internal array is full.
We early stop growing the internal array due to wrongly check array size at:
```java
if (numKeys >= growthThreshold && longArray.size() < MAX_CAPACITY) {
...
}
```
Another point
https://github.com/apache/spark/pull/26828#discussion_r356370350 is we may want
to set `canGrowArray` to false once we are close to max capacity, so we can
avoid infinite loop again.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]