[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

jliwork Mon, 20 Nov 2017 15:03:07 -0800

Github user jliwork commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19776#discussion_r152136470
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
    @@ -497,7 +497,19 @@ object DataSourceStrategy {
             Some(sources.IsNotNull(a.name))
     
           case expressions.And(left, right) =>
    -        (translateFilter(left) ++ 
translateFilter(right)).reduceOption(sources.And)
    +        // See SPARK-12218 for detailed discussion
    +        // It is not safe to just convert one side if we do not understand 
the
    +        // other side. Here is an example used to explain the reason.
    +        // Let's say we have (a = 2 AND trim(b) = 'blah') OR (c > 0)
    +        // and we do not understand how to convert trim(b) = 'blah'.
    +        // If we only convert a = 2, we will end up with
    +        // (a = 2) OR (c > 0), which will generate wrong results.
    +        // Pushing one leg of AND down is only safe to do at the top level.
    +        // You can see ParquetFilters' createFilter for more details.
    +        for {
    +          leftFilter <- translateFilter(left)
    +          rightFilter <- translateFilter(right)
    +        } yield sources.And(leftFilter, rightFilter)
    --- End diff --
    
    I would think so.  SPARK-12218 put fixes into ParquetFilters.createFilter 
and OrcFilters.createFilter. They're similar to 
DataSourceStrategy.translateFilter but have different signature customized for 
Parquet and Orc. For all datasources including JDBC, Parquet, etc, 
translateFilter is called to determine if a predicate Expression can be pushed 
down as a Filter or not. Next for Parquet and ORC, Fitlers get mapped to 
Parquet or ORC specific filters with their own createFilter method. 
    
    So this PR does help all datasources to get the correct set of push down 
predicates. Without this PR we simply got lucky with Parquet and ORC in terms 
of result correctness because 1) it looks like we always apply Filter on top of 
scan; 2) we end up with same number of or more rows returned with one leg 
missing from AND. 
    
    JDBC data source does not always come with Filter on top of scan therefore 
exposed the bug.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

Reply via email to