GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/20561

    [SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap may 
fail

    ## What changes were proposed in this pull request?
    
    This is a long-standing bug in `UnsafeKVExternalSorter` and was reported in 
the dev list multiple times.
    
    When creating `UnsafeKVExternalSorter` with `BytesToBytesMap`, we need to 
create a `UnsafeInMemorySorter` to sort the data in `BytesToBytesMap`. The data 
format of the sorter and the map is same, so no data movement is required. 
However, both the sorter and the map need a point array for some bookkeeping 
work.
    
    There is an optimization in `UnsafeKVExternalSorter`: reuse the point array 
between the sorter and the map, to avoid an extra memory allocation. This 
sounds like a reasonable optimization, the length of the `BytesToBytesMap` 
point array is at least 4 times larger than the number of keys(to avoid hash 
collision, the hash table size should be at least 2 times larger than the 
number of keys, and each key occupies 2 slots). `UnsafeInMemorySorter` needs 
the pointer array size to be 4 times of the number of entries, so we are safe 
to reuse the point array.
    
    However, the number of keys of the map doesn't equal to the number of 
entries in the map, because `BytesToBytesMap` supports duplicated keys. This 
breaks the assumption of the above optimization and we may run out of space 
when inserting data into the sorter, and hit error
    ```
    java.lang.IllegalStateException: There is no space for new record
       at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:239)
       at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:149)
    ...
    ```
    
    This PR fixes this bug by creating a new point array if the existing one is 
not big enough.
    
    ## How was this patch tested?
    
    a new test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20561.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20561
    
----

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to