GitHub user cloud-fan opened a pull request:
[SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap may
## What changes were proposed in this pull request?
This is a long-standing bug in `UnsafeKVExternalSorter` and was reported in
the dev list multiple times.
When creating `UnsafeKVExternalSorter` with `BytesToBytesMap`, we need to
create a `UnsafeInMemorySorter` to sort the data in `BytesToBytesMap`. The data
format of the sorter and the map is same, so no data movement is required.
However, both the sorter and the map need a point array for some bookkeeping
There is an optimization in `UnsafeKVExternalSorter`: reuse the point array
between the sorter and the map, to avoid an extra memory allocation. This
sounds like a reasonable optimization, the length of the `BytesToBytesMap`
point array is at least 4 times larger than the number of keys(to avoid hash
collision, the hash table size should be at least 2 times larger than the
number of keys, and each key occupies 2 slots). `UnsafeInMemorySorter` needs
the pointer array size to be 4 times of the number of entries, so we are safe
to reuse the point array.
However, the number of keys of the map doesn't equal to the number of
entries in the map, because `BytesToBytesMap` supports duplicated keys. This
breaks the assumption of the above optimization and we may run out of space
when inserting data into the sorter, and hit error
java.lang.IllegalStateException: There is no space for new record
This PR fixes this bug by creating a new point array if the existing one is
not big enough.
## How was this patch tested?
a new test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark bug
Alternatively you can review and apply these changes as the patch at:
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20561
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org