Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19635
@gatorsmile I see you point. And of course there is a behavioral change.
For instance, `select 1 in ('01')` before the PR return `false`, after it
returns `true`.
Nonetheless, I think that the current behavior is not good at all. Indeed,
the problem is not only that `IN` works differently from `=` (which actually IS
a problem, since there are places in the code like
https://github.com/apache/spark/blob/572284c5b08515901267a37adef4f8e55df3780e/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L150
where `IN` is translated to a sequence of equality comparisons). But the real
issue is that IN behaves differently whether it is used with a subquery or with
a list of literals. For instance, please refer to the test I added in this PR.
This is very bad, since maybe people are using hardcoded literals for testing
and a subquery in their real workload and the behavior might change between
these two scenarios. Sometimes, currently, what is working with literals is
even throwing an exception with subqueries. Or they are simply returning
different results.
Thus, I do believe that despite introducing a behavior change is generally
something we would like to avoid, here the current situation is too bad to let
it as it is. And I think this is the change which minimizes the behavioral
changes making them coherent, but of course I am open to any kind of discussion
about this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]