Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20561#discussion_r167387138
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeKVExternalSorterSuite.scala
 ---
    @@ -205,4 +206,42 @@ class UnsafeKVExternalSorterSuite extends 
SparkFunSuite with SharedSQLContext {
           spill = true
         )
       }
    +
    +  test("SPARK-23376: Create UnsafeKVExternalSorter with BytesToByteMap 
having duplicated keys") {
    +    val memoryManager = new TestMemoryManager(new SparkConf())
    +    val taskMemoryManager = new TaskMemoryManager(memoryManager, 0)
    +    val map = new BytesToBytesMap(taskMemoryManager, 64, 
taskMemoryManager.pageSizeBytes())
    +
    +    // Key/value are a unsafe rows with a single int column
    +    val schema = new StructType().add("i", IntegerType)
    +    val key = new UnsafeRow(1)
    +    key.pointTo(new Array[Byte](32), 32)
    +    key.setInt(0, 1)
    +    val value = new UnsafeRow(1)
    +    value.pointTo(new Array[Byte](32), 32)
    +    value.setInt(0, 2)
    +
    +    for (_ <- 1 to 65) {
    +      val loc = map.lookup(key.getBaseObject, key.getBaseOffset, 
key.getSizeInBytes)
    +      loc.append(
    +        key.getBaseObject, key.getBaseOffset, key.getSizeInBytes,
    +        value.getBaseObject, value.getBaseOffset, value.getSizeInBytes)
    +    }
    +
    +    // Make sure we can successfully create a UnsafeKVExternalSorter with 
a `BytesToBytesMap`
    +    // which has duplicated keys and the number of entries exceeds its 
capacity.
    --- End diff --
    
    yes, we use `BytesToBytesMap` to build the broadcast join hash relation, 
which may have duplicated keys. I only create a new pointer array if the 
existing one is not big enough, so we won't have performance regression for 
aggregate.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to