[GitHub] spark pull request #16021: JDBCRDD returns incorrect results for filters on ...

dongjoon-hyun Sun, 27 Nov 2016 03:06:08 -0800

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/16021


    JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL

    ## What changes were proposed in this pull request?
    
    In Apache Spark 1.6.x, JDBCRDD returns incorrect results for filters on 
CHAR of PostgreSQL. The root cause is that PostgreSQL returns `space padded 
string` as a result. So, the post processing filter `Filter (a#0 = A)` is 
evaluated false. Spark 2.0.0 removes the post filter because it is already 
handled inside the database by `PushedFilters: [EqualTo(a,A)]`. This PR 
backports the relevant commits to fix this problem in Spark 1.6.x.
    
    **PostgreSQL Table**
    ```
    postgres=# \d
               List of relations
     Schema |   Name    | Type  |  Owner   
    --------+-----------+-------+----------
     public | t_char    | table | postgres
     public | t_varchar | table | postgres
    (2 rows)
    ```
    
    **Spark 1.6.3 Result**
    ```scala
    scala> val t_char = sqlContext.read.option("user", 
"postgres").option("password", 
"rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_char", new 
java.util.Properties())
    t_char: org.apache.spark.sql.DataFrame = [a: string]
    
    scala> val t_varchar = sqlContext.read.option("user", 
"postgres").option("password", 
"rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_varchar", new 
java.util.Properties())
    t_varchar: org.apache.spark.sql.DataFrame = [a: string]
    
    scala> t_char.show
    +----------+
    |         a|
    +----------+
    |A         |
    |AA        |
    |AAA       |
    +----------+
    
    scala> t_varchar.show
    +---+
    |  a|
    +---+
    |  A|
    | AA|
    |AAA|
    +---+
    
    scala> t_char.filter(t_char("a")==="A").show  // This is the wrong result
    +---+
    |  a|
    +---+
    +---+
    
    scala> t_char.filter(t_char("a")==="A         ").show
    +----------+
    |         a|
    +----------+
    |A         |
    +----------+
    
    scala> t_varchar.filter(t_varchar("a")==="A").show
    +---+
    |  a|
    +---+
    |  A|
    +---+
    
    scala> t_char.filter(t_char("a")==="A").explain
    == Physical Plan ==
    Filter (a#0 = A)
    +- Scan 
JDBCRelation(jdbc:postgresql://localhost:5432/postgres,t_char,[Lorg.apache.spark.Partition;@2f65c341,{user=postgres,
 password=rootpass})[a#0] PushedFilters: [EqualTo(a,A)]
    ```
    
    ## How was this patch tested?
    
    N/A (This is a backport of the commits).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-18593

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16021.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16021
    
----
commit d712319be2f6fd212e9bbbae51f8dca747b50ae4
Author: Takeshi YAMAMURO <[email protected]>
Date:   2015-12-22T08:50:05Z

    [SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions
    
    No tests done for JDBCRDD#compileFilter.
    
    Author: Takeshi YAMAMURO <[email protected]>
    
    Closes #10409 from maropu/AddTestsInJdbcRdd.
    
    (cherry picked from commit 8c1b867cee816d0943184c7b485cd11e255d8130)
    Signed-off-by: Dongjoon Hyun <[email protected]>

commit abaae83cd26e419828067818f36d0cfd7862b3fd
Author: Takeshi YAMAMURO <[email protected]>
Date:   2015-12-30T21:34:37Z

    [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE 
push-down filters for JDBC
    
    This is rework from #10386 and add more tests and LIKE push-down support.
    
    Author: Takeshi YAMAMURO <[email protected]>
    
    Closes #10468 from maropu/SupportMorePushdownInJdbc.
    
    (cherry picked from commit 5c2682b0c8fd2aeae2af1adb716ee0d5f8b85135)
    Signed-off-by: Dongjoon Hyun <[email protected]>

commit bb9aa2cf779bcce4555f4c6bcb7f924b5bb26989
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-01-01T08:54:25Z

    [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown for 
JDBCRDD and add few filters
    
    This patch refactors the filter pushdown for JDBCRDD and also adds few 
filters.
    
    Added filters are basically from #10468 with some refactoring. Test cases 
are from #10468.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #10470 from viirya/refactor-jdbc-filter.
    
    (cherry picked from commit ad5b7cfcca7a5feb83b9ed94b6e725c6d789579b)
    Signed-off-by: Dongjoon Hyun <[email protected]>

commit c1b9d83af5fe383f83bfe7da08a8be2f045e34fe
Author: hyukjinkwon <[email protected]>
Date:   2016-01-02T08:04:48Z

    [SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filter
    
    This PR is followed by https://github.com/apache/spark/pull/8391.
    Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC 
datasource. This PR fixes the problem that it can actually return null as a 
result of the comparison resulting error as using the value of that comparison.
    
    Author: hyukjinkwon <[email protected]>
    Author: HyukjinKwon <[email protected]>
    
    Closes #8743 from HyukjinKwon/SPARK-10180.
    
    (cherry picked from commit 94f7a12b3c8e4a6ecd969893e562feb7ffba4c24)
    Signed-off-by: Dongjoon Hyun <[email protected]>

commit 59b7e4c20c5da81b26039e7e049e3ab656bf1fbe
Author: Takeshi YAMAMURO <[email protected]>
Date:   2016-02-10T01:45:13Z

    [SPARK-12476][SQL] Implement JdbcRelation#unhandledFilters for removing 
unnecessary Spark Filter
    
    Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
    
    Current plan:
    ```
    == Optimized Logical Plan ==
    Project [col0#0,col1#1]
    +- Filter (col0#0 = xxx)
       +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})
    
    == Physical Plan ==
    +- Filter (col0#0 = xxx)
       +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
    ```
    
    This patch enables a plan below;
    ```
    == Optimized Logical Plan ==
    Project [col0#0,col1#1]
    +- Filter (col0#0 = xxx)
       +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})
    
    == Physical Plan ==
    Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
    ```
    
    Author: Takeshi YAMAMURO <[email protected]>
    
    Closes #10427 from maropu/RemoveFilterInJdbcScan.
    
    (cherry picked from commit 6f710f9fd4f85370557b7705020ff16f2385e645)
    Signed-off-by: Dongjoon Hyun <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16021: JDBCRDD returns incorrect results for filters on ...

Reply via email to