GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/9152
[SPARK-11153] [SQL] Disables Parquet filter push-down for string and binary
columns
Due to PARQUET-251, `BINARY` columns in existing Parquet files may be
written with corrupted statistics information. This information is used by
filter push-down optimization. Since Spark 1.5 turns on Parquet filter
push-down by default, we may end up with wrong query results. PARQUET-251 has
been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
This affects all Spark SQL data types that can be mapped to Parquet
{{BINARY}}, namely:
- `StringType`
- `BinaryType`
- `DecimalType`
(But Spark SQL doesn't support pushing down filters involving
`DecimalType` columns for now.)
To avoid wrong query results, we should disable filter push-down for
columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark
spark-11153.workaround-parquet-251
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9152.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9152
----
commit 3584d032e1ca4b10015a21c2b6b3f153014e3dfb
Author: Cheng Lian <[email protected]>
Date: 2015-10-16T20:09:19Z
Disables Parquet filter push-down for string and binary columns
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]