GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/8107

    [SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to 
be pushed down

    This PR adds a hacky workaround for PARQUET-201, and should be removed once 
we upgrade to parquet-mr 1.8.1 or higher versions.
    
    In Parquet, not all types of columns can be used for filter push-down 
optimization.  The set of valid column types is controlled by `ValidTypeMap`.  
Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too 
strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down.  On the 
other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by 
libraries like `parquet-avro`.
    
    This restriction is problematic for Spark SQL, because Spark SQL doesn't 
have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts 
`BINARY (ENUM)` to Catalyst `StringType`.  Thus, a predicate involving a 
`BINARY (ENUM)` is recognized as one involving a string field instead and can 
be pushed down by the query optimizer.  Such predicates are actually perfectly 
legal except that it fails the `ValidTypeMap` check.
    
    The workaround added here is relaxing `ValidTypeMap` to include `BINARY 
(ENUM)`.  I also took the chance to simplify `ParquetCompatibilityTest` a 
little bit when adding regression test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark 
spark-9407/parquet-enum-filter-push-down

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8107.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8107
    
----
commit ed5b26c1069630105c32213581ba74caf881bf93
Author: Cheng Lian <[email protected]>
Date:   2015-08-11T15:30:17Z

    Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to