viirya commented on a change in pull request #26828: [SPARK-30198][Core]
BytesToBytesMap does not grow internal long array as expected
URL: https://github.com/apache/spark/pull/26828#discussion_r356370350
##########
File path: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java
##########
@@ -741,7 +741,9 @@ public boolean append(Object kbase, long koff, int klen,
Object vbase, long voff
longArray.set(pos * 2 + 1, keyHashcode);
isDefined = true;
- if (numKeys >= growthThreshold && longArray.size() < MAX_CAPACITY) {
+ // We use two array entries per key, so the array size is twice the
capacity.
+ // We should compare the current capacity of the array, instead of its
size.
+ if (numKeys >= growthThreshold && longArray.size() / 2 < MAX_CAPACITY)
{
try {
growAndRehash();
Review comment:
Actually, I also think that we should false canGrowArray like:
```scala
if (numKeys >= growthThreshold && longArray.size() / 2 >= MAX_CAPACITY) {
canGrowArray = false;
}
```
So as we reach max capacity of the map, canGrowArray is set to false. We can
fail next append and let the map spill and fallback to sort-based aggregation
in HashAggregate. Thus we can prevent a similar forever-loop happens when we
reach max capacity.
cc @cloud-fan @felixcheung
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]