[jira] [Commented] (IMPALA-7559) Inconsistent parquet stat filtering of timestamps at dst change

Csaba Ringhofer (JIRA) Tue, 11 Sep 2018 08:47:26 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610818#comment-16610818
 ]


Csaba Ringhofer commented on IMPALA-7559:
-----------------------------------------

It turned out that this problem exists for any timestamp if 
convert_legacy_hive_parquet_utc_timestamps=true, not just for ones during dst 
change. The problem is that convert_legacy_hive_parquet_utc_timestamps is not 
considered at all during stat filtering, so the predicate and parquet stats are 
assumed to use the same time zone.

This means that this issue is much more serious (=frequent) than I thought - 
timestamps in the first / last hours of a column chunk will not be returned if 
there is an EQ predicate on them.

> Inconsistent parquet stat filtering of timestamps at dst change
> ---------------------------------------------------------------
>
>                 Key: IMPALA-7559
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7559
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Minor
>              Labels: correctness, parquet, wrongresults
>
> If the min/max value of a timestamp column chunk is during the hour of the 
> Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can 
> drop row groups that contain rows that would be "ok" for the predicate 
> otherwise.
> To reproduce (on current master branch):
> {code}
> 1. it is assumed that the timezone is CET and that flag 
> convert_legacy_hive_parquet_utc_timestamps is enabled
> ( export TZ=CET; bin/start-impala-cluster.py 
> --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
> 2. create a table in hive and fill data in 3 inserts to create 3 files:
> create table t (i int, d timestamp) stored as parquet;
> insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
> insert into t values (3, "2018-10-28 02:30:00");
> insert into t values (4, "2017-10-29 02:30:00")
> 3. Query from Impala
> set num_nodes=1;
> select * from t; -- returns all 4 values (same as Hive) 
> select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive 
> returns 1,4)
> select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive 
> returns 2,3)
> profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been 
> stat filtered)
> select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 
> in Impala (same as Hive), because the "or" part disabled stat filtering
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-7559) Inconsistent parquet stat filtering of timestamps at dst change

Reply via email to