[ 
https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615026#comment-16615026
 ] 

ASF subversion and git services commented on IMPALA-7559:
---------------------------------------------------------

Commit eb3108c461523915b20f6686d7aabcaddfe79114 in impala's branch 
refs/heads/master from [~csringhofer]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=eb3108c ]

IMPALA-7559: Disable stat filtering for UTC-normalized timestamp columns

If convert_legacy_hive_parquet_utc_timestamps=true and the Parquet
file is by parquet-mr (also used by Hive), then timestamps are
converted from UTC to local time during scanning. Stat filtering
did not handle this case correctly and compared UTC min/max values
from stats with local min/max values from predicates. This could
lead to skipping row groups incorrectly.

Note that parquet-mr only writes stats if min and max are equal,
because it cannot order timestamps correctly, so the only case
affected here is when every value is the same in the column chunk.

It would be possible to implement stat filtering correctly, but
this is non-trivial because of DST and historical timezone rule
changes.

Testing:
- added a Hive generated parquet file + custom cluster test
  that could reproduce this issue

Change-Id: Id4c02230993f2390c03d513f08bae2e9d3d538fa
Reviewed-on: http://gerrit.cloudera.org:8080/11431
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-7559
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7559
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Blocker
>              Labels: correctness, parquet, wrongresults
>             Fix For: Impala 3.1.0
>
>
> UPDATE: the issue turned out to be different than I first thought, see my 
> last comment. I will update the description with more details later.
> If the min/max value of a timestamp column chunk is during the hour of the 
> Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can 
> drop row groups that contain rows that would be "ok" for the predicate 
> otherwise.
> To reproduce (on current master branch):
> {code}
> 1. it is assumed that the timezone is CET and that flag 
> convert_legacy_hive_parquet_utc_timestamps is enabled
> ( export TZ=CET; bin/start-impala-cluster.py 
> --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
> 2. create a table in hive and fill data in 3 inserts to create 3 files:
> create table t (i int, d timestamp) stored as parquet;
> insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
> insert into t values (3, "2018-10-28 02:30:00");
> insert into t values (4, "2017-10-29 02:30:00")
> 3. Query from Impala
> set num_nodes=1;
> select * from t; -- returns all 4 values (same as Hive) 
> select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive 
> returns 1,4)
> select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive 
> returns 2,3)
> profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been 
> stat filtered)
> select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 
> in Impala (same as Hive), because the "or" part disabled stat filtering
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to