Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19635
  
    @gatorsmile I see you point. And of course there is a behavioral change. 
For instance, `select 1 in ('01')` before the PR return `false`, after it 
returns `true`.
    
    Nonetheless, I think that the current behavior is not good at all. Indeed, 
the problem is not only that `IN` works differently from `=` (which actually IS 
a problem, since there are places in the code like 
https://github.com/apache/spark/blob/572284c5b08515901267a37adef4f8e55df3780e/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L150
 where `IN` is translated to a sequence of equality comparisons). But the real 
issue is that IN behaves differently whether it is used with a subquery or with 
a list of literals. For instance, please refer to the test I added in this PR. 
This is very bad, since maybe people are using hardcoded literals for testing 
and a subquery in their real workload and the behavior might change between 
these two scenarios. Sometimes, currently, what is working with literals is 
even throwing an exception with subqueries. Or they are simply returning 
different results.
    
    Thus, I do believe that despite introducing a behavior change is generally 
something we would like to avoid, here the current situation is too bad to let 
it as it is. And I think this is the change which minimizes the behavioral 
changes making them coherent, but of course I am open to any kind of discussion 
about this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to