Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/21102#discussion_r203322056
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
* to a new position (in the new data array).
*/
def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int,
Int) => Unit) {
- if (_size > _growThreshold) {
+ if (_occupied > _growThreshold) {
--- End diff --
There is no explicitly entry here - it is simply unoccupied slots in an
array.
The slot is free, it can be used by some other (new) entry when insert is
called.
It must be trivial to see how very bad behavior can happen with actual size
of set being very small - with a series of add/remove's : resulting in unending
growth of the set.
something like this, for example, is enough to cause set to blow to 2B
entries:
```
var i = 0
while (i < Int.MaxValue) {
set.add(1)
set.remove(1)
assert (0 == set.size)
i += 1
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]