Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6763#discussion_r32493399
  
    --- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
    @@ -223,6 +224,8 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
        */
       private def rehash(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, 
Int) => Unit) {
         val newCapacity = _capacity * 2
    +    require(newCapacity <= OpenHashSet.MAX_CAPACITY,
    --- End diff --
    
    I agree this is the theoretically largest number of elements that can be in 
the set. The failure will occur any time that twice the grow threshold exceeds 
`MAX_CAPACITY`, which can happen when the collection is less full than this. So 
I am actually not sure what's clearer here. Up to you.
    
    I think we still have a little problem here, because when capacity reaches 
2^30, twice that number becomes negative and `newCapacity <= 
OpenHashSet.MAX_CAPACITY` is true, still, because of overflow. Check whether 
the existing capacity is `<= OpenHashSet.MAX_CAPACITY / 2` first?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to