GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/14067

    [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and 
outer name are the same in Parquet

    ## What changes were proposed in this pull request?
    
    Currently, if there is a schema as below:
    
    ```
    root
      |-- _1: struct (nullable = true)
      |    |-- _1: string (nullable = true)
    ```
    
    and if we execute the codes below:
    
    ```scala
    df.filter("_1 IS NOT NULL").count()
    ```
    
    pushes down a filter although this filter is being applied to 
`StructType`.(If my understanding is correct, Spark does not pushes down 
filters for those).
    
    The reason is, `ParquetFilters.getFieldMap` produces results below:
    
    ```
    (_1,StructType(StructField(_1,IntegerType,true)))
    (_1,IntegerType)
    ```
    
    and then it becomes a `Map`
    
    ```
    (_1,IntegerType)
    ```
    
    Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this 
pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it 
is actually `StructType`.
    
    So, Parquet filter2 is failed and then
    
    ```
    df.filter("_1 IS NOT NULL").count()
    ```
    
    produces 0.
    
    
    This PR prevents this by pre-checking supported types for Parquet filter 
push down. 
    
    ## How was this patch tested?
    
    Unit test in `ParquetFilterSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-16371

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14067
    
----
commit 54b834838c1855fcea0d108b19b89119fe351618
Author: hyukjinkwon <[email protected]>
Date:   2016-07-06T10:00:53Z

    Do not push down filters incorrectly when inner name and outer name are the 
same

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to