[GitHub] spark pull request: [SPARK-11153] [SQL] Disables Parquet filter pu...

liancheng Fri, 16 Oct 2015 13:13:32 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/9152


    [SPARK-11153] [SQL] Disables Parquet filter push-down for string and binary 
columns

    Due to PARQUET-251, `BINARY` columns in existing Parquet files may be 
written with corrupted statistics information. This information is used by 
filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
push-down by default, we may end up with wrong query results. PARQUET-251 has 
been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
    
    This affects all Spark SQL data types that can be mapped to Parquet 
{{BINARY}}, namely:
    
    - `StringType`
    
    - `BinaryType`
    
    - `DecimalType`
    
      (But Spark SQL doesn't support pushing down filters involving 
`DecimalType` columns for now.)
    
    To avoid wrong query results, we should disable filter push-down for 
columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark 
spark-11153.workaround-parquet-251

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9152
    
----
commit 3584d032e1ca4b10015a21c2b6b3f153014e3dfb
Author: Cheng Lian <[email protected]>
Date:   2015-10-16T20:09:19Z

    Disables Parquet filter push-down for string and binary columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11153] [SQL] Disables Parquet filter pu...

Reply via email to