[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

mridulm Wed, 18 Jul 2018 03:06:59 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21102#discussion_r203322056
  
    --- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
    @@ -163,7 +187,7 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
        *                 to a new position (in the new data array).
        */
       def rehashIfNeeded(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, 
Int) => Unit) {
    -    if (_size > _growThreshold) {
    +    if (_occupied > _growThreshold) {
    --- End diff --
    
    There is no explicitly entry here - it is simply unoccupied slots in an 
array.
    The slot is free, it can be used by some other (new) entry when insert is 
called.
    
    It must be trivial to see how very bad behavior can happen with actual size 
of set being very small - with a series of add/remove's : resulting in unending 
growth of the set.
    
    something like this, for example, is enough to cause set to blow to 2B 
entries:
    ```
    var i = 0
    while (i < Int.MaxValue) {
      set.add(1)
      set.remove(1)
      assert (0 == set.size)
      i += 1
    }
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21102: [SPARK-23913][SQL] Add array_intersect function

Reply via email to