[
https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610818#comment-16610818
]
Csaba Ringhofer commented on IMPALA-7559:
-----------------------------------------
It turned out that this problem exists for any timestamp if
convert_legacy_hive_parquet_utc_timestamps=true, not just for ones during dst
change. The problem is that convert_legacy_hive_parquet_utc_timestamps is not
considered at all during stat filtering, so the predicate and parquet stats are
assumed to use the same time zone.
This means that this issue is much more serious (=frequent) than I thought -
timestamps in the first / last hours of a column chunk will not be returned if
there is an EQ predicate on them.
> Inconsistent parquet stat filtering of timestamps at dst change
> ---------------------------------------------------------------
>
> Key: IMPALA-7559
> URL: https://issues.apache.org/jira/browse/IMPALA-7559
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Csaba Ringhofer
> Priority: Minor
> Labels: correctness, parquet, wrongresults
>
> If the min/max value of a timestamp column chunk is during the hour of the
> Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can
> drop row groups that contain rows that would be "ok" for the predicate
> otherwise.
> To reproduce (on current master branch):
> {code}
> 1. it is assumed that the timezone is CET and that flag
> convert_legacy_hive_parquet_utc_timestamps is enabled
> ( export TZ=CET; bin/start-impala-cluster.py
> --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
> 2. create a table in hive and fill data in 3 inserts to create 3 files:
> create table t (i int, d timestamp) stored as parquet;
> insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
> insert into t values (3, "2018-10-28 02:30:00");
> insert into t values (4, "2017-10-29 02:30:00")
> 3. Query from Impala
> set num_nodes=1;
> select * from t; -- returns all 4 values (same as Hive)
> select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive
> returns 1,4)
> select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive
> returns 2,3)
> profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been
> stat filtered)
> select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3
> in Impala (same as Hive), because the "or" part disabled stat filtering
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]