[
https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610754#comment-16610754
]
Csaba Ringhofer commented on IMPALA-7559:
-----------------------------------------
The only solution I see at the moment is to make the accepted interval larger
if min/max falls into a hour where local->utc mapping is ambiguous. Min value
should be set to the smaller possible UTC value, while max value should be set
to the larger one, e.g. predicate
d = "2017-10-29 02:30:00" (CET)
should be transformed to stat filter
"2017-10-29 00:30:00" <= d <= "2017-10-29 01:30:00" (UTC).
I think that CCTZ provides the necessary information, so the only problem is
the integration into Impala's Parquet reader logic.
> Inconsistent parquet stat filtering of timestamps at dst change
> ---------------------------------------------------------------
>
> Key: IMPALA-7559
> URL: https://issues.apache.org/jira/browse/IMPALA-7559
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Csaba Ringhofer
> Priority: Minor
>
> If the min/max value of a timestamp column chunk is during the hour of the
> Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can
> drop row groups that contain rows that would be "ok" for the predicate
> otherwise.
> To reproduce (on current master branch):
> {code}
> 1. it is assumed that the timezone is CET and that flag
> convert_legacy_hive_parquet_utc_timestamps is enabled
> ( export TZ=CET; bin/start-impala-cluster.py
> --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
> 2. create a table in hive and fill data in 3 inserts to create 3 files:
> create table t (i int, d timestamp) stored as parquet;
> insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
> insert into t values (3, "2018-10-28 02:30:00");
> insert into t values (4, "2017-10-29 02:30:00")
> 3. Query from Impala
> set num_nodes=1;
> select * from t; -- returns all 4 values (same as Hive)
> select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive
> returns 1,4)
> select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive
> returns 2,3)
> profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been
> stat filtered)
> select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3
> in Impala (same as Hive), because the "or" part disabled stat filtering
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]