[GitHub] [spark] wankunde commented on pull request #37759: [SPARK-40306][SQL]Support more than Integer.MAX_VALUE of the same join key

GitBox Tue, 06 Sep 2022 20:02:26 -0700


wankunde commented on PR #37759:
URL: https://github.com/apache/spark/pull/37759#issuecomment-1238852769


   
   > Can we add a test? or at least can you describe how you tested it?
   
   A simple UT
   
   ```
     test("join with too many duplicate key") {
       withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
         val duplicateKeyNumber = Integer.MAX_VALUE + 2
         val df1 = Seq(1).toDF("id")
         val df2 =
           spark.range(1, duplicateKeyNumber, 1, 1000).map(_ => 1).toDF("id")
         df1.join(df2, Seq("id"), "inner").collect()
       }
     }
   ```
   For this SMJ, the SortMergeJoinScanner will append all `id=1` rows into a 
`ExternalAppendOnlyUnsafeRowArray`. After appending `Integer.MAX_VALUE` rows,  
`ExternalAppendOnlyUnsafeRowArray.numRows` will be overflow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wankunde commented on pull request #37759: [SPARK-40306][SQL]Support more than Integer.MAX_VALUE of the same join key

Reply via email to