[GitHub] [spark] MaxGekk opened a new pull request #28328: [SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold

GitBox Fri, 24 Apr 2020 05:10:37 -0700


MaxGekk opened a new pull request #28328:
URL: https://github.com/apache/spark/pull/28328



   ### What changes were proposed in this pull request?
   The `InSet` expression expects input collections of internal Catalyst types, 
for example `hset` must contain elements of `UTF8String` for `child` of string 
type. So, it means `isInCollection` must convert users values to internal 
Catalyst values but currently it doesn't perform the conversion. That leads to 
incorrect results for collection sizes above the threshold 
`spark.sql.optimizer.inSetConversionThreshold`.
   
   ### Why are the changes needed?
   The changes fix incorrect behaviour of `isInCollection`. For example, if the 
SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by 
default):
   ```scala
   val set = (0 to 20).map(_.toString).toSet
   val data = Seq("1").toDF("x")
   data.select($"x".isInCollection(set).as("isInCollection")).show()
   ```
   The function must return **'true'** because "1" is in the set of "0" ... 
"20" but it returns "false":
   ```
   +--------------+
   |isInCollection|
   +--------------+
   |         false|
   +--------------+
   ```
   
   ### Does this PR introduce any user-facing change?
   Yes
   
   ### How was this patch tested?
   - By existing test suite `ColumnExpressionSuite`
   - Add new test to `ColumnExpressionSuite`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk opened a new pull request #28328: [SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold

Reply via email to