Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/805#discussion_r125784066
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
    @@ -1008,8 +1008,24 @@ public void setMax(Object max) {
           return nulls;
         }
     
    -    @Override public boolean hasSingleValue() {
    -      return (max != null && min != null && max.equals(min));
    +    /**
    +     * Checks that the column chunk has single value.
    +     * Returns true if min and max are the same, but not null.
    +     * Returns true if min and max are null and the number of null values
    +     * in the column chunk is greater than 0.
    +     *
    +     * @return true if column has single value
    --- End diff --
    
    My understanding is hasSingleValue() return true if the column meta data 
shows only one single value.  A null value is also counted as a different value 
from other non-null value.
    
    Therefore, for the case of  column has min != null && max !=null && 
min.equals(max) && nulls!=null && nulls > 0, it should return false. However, 
in both the implementation of v1 and v3, it would return true. 
    
    That would actually lead to wrong query result.  A simple reproduce:
    
    ```
    create table dfs.tmp.`t5/a` as select 100 as mykey from 
cp.`tpch/nation.parquet` union all select col_notexist from 
cp.`tpch/region.parquet`;
    
    create table dfs.tmp.`t5/b` as select 200 as mykey from 
cp.`tpch/nation.parquet` union all select col_notexist from 
cp.`tpch/region.parquet`;
    ```
    
    We got two files, each having one single unique non-null value, plus null 
values. Now query the two files:
    
    ```
    select mykey from dfs.tmp.`t5` where mykey = 100;
    +--------+
    | mykey  |
    +--------+
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | null   |
    | null   |
    | null   |
    | null   |
    | null   |
    +--------+
    30 rows selected (0.246 seconds)
    
    ```
    Apparently, those 5 nulls should not be returned. 
    
    I applied the 3 commits in this PR on top of today's master branch.
    
    ```
    select * from sys.version;
    
+------------------+-------------------------------------------+-------------------------------------------------------------------------------+----------------------------+-----------------+----------------------------+
    |     version      |                 commit_id                 |            
                    commit_message                                 |        
commit_time         |   build_email   |         build_time         |
    
+------------------+-------------------------------------------+-------------------------------------------------------------------------------+----------------------------+-----------------+----------------------------+
    | 1.11.0-SNAPSHOT  | cad6e4dc950aa4a95ad20515ce5abd9c546d3e5d  | 
DRILL-4139: Fix loss of scale value for DECIMAL in parquet partition pruning  | 
05.07.2017 @ 12:05:25 PDT  | j...@apache.org  | 05.07.2017 @ 12:06:07 PDT  |
    +------------------+-------------------------------------------+-----
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to