MaxGekk opened a new pull request #28328:
URL: https://github.com/apache/spark/pull/28328
### What changes were proposed in this pull request?
The `InSet` expression expects input collections of internal Catalyst types,
for example `hset` must contain elements of `UTF8String` for `child` of string
type. So, it means `isInCollection` must convert users values to internal
Catalyst values but currently it doesn't perform the conversion. That leads to
incorrect results for collection sizes above the threshold
`spark.sql.optimizer.inSetConversionThreshold`.
### Why are the changes needed?
The changes fix incorrect behaviour of `isInCollection`. For example, if the
SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by
default):
```scala
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
data.select($"x".isInCollection(set).as("isInCollection")).show()
```
The function must return **'true'** because "1" is in the set of "0" ...
"20" but it returns "false":
```
+--------------+
|isInCollection|
+--------------+
| false|
+--------------+
```
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- By existing test suite `ColumnExpressionSuite`
- Add new test to `ColumnExpressionSuite`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]