[jira] [Updated] (IMPALA-7559) Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps

Csaba Ringhofer (JIRA) Tue, 11 Sep 2018 08:52:28 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-7559:
------------------------------------
    Description: 
UPDATE: the issue turned out to be different than I first thought, see my last 
comment. I will update the description with more details later.

If the min/max value of a timestamp column chunk is during the hour of the 
Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can drop 
row groups that contain rows that would be "ok" for the predicate otherwise.

To reproduce (on current master branch):
{code}
1. it is assumed that the timezone is CET and that flag 
convert_legacy_hive_parquet_utc_timestamps is enabled
( export TZ=CET; bin/start-impala-cluster.py 
--impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
2. create a table in hive and fill data in 3 inserts to create 3 files:
create table t (i int, d timestamp) stored as parquet;
insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
insert into t values (3, "2018-10-28 02:30:00");
insert into t values (4, "2017-10-29 02:30:00")
3. Query from Impala
set num_nodes=1;
select * from t; -- returns all 4 values (same as Hive) 
select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive 
returns 1,4)
select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive 
returns 2,3)
profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been 
stat filtered)
select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 in 
Impala (same as Hive), because the "or" part disabled stat filtering
{code}

  was:
If the min/max value of a timestamp column chunk is during the hour of the 
Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can drop 
row groups that contain rows that would be "ok" for the predicate otherwise.

To reproduce (on current master branch):
{code}
1. it is assumed that the timezone is CET and that flag 
convert_legacy_hive_parquet_utc_timestamps is enabled
( export TZ=CET; bin/start-impala-cluster.py 
--impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
2. create a table in hive and fill data in 3 inserts to create 3 files:
create table t (i int, d timestamp) stored as parquet;
insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
insert into t values (3, "2018-10-28 02:30:00");
insert into t values (4, "2017-10-29 02:30:00")
3. Query from Impala
set num_nodes=1;
select * from t; -- returns all 4 values (same as Hive) 
select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive 
returns 1,4)
select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive 
returns 2,3)
profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been 
stat filtered)
select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 in 
Impala (same as Hive), because the "or" part disabled stat filtering
{code}


> Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-7559
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7559
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: correctness, parquet, wrongresults
>
> UPDATE: the issue turned out to be different than I first thought, see my 
> last comment. I will update the description with more details later.
> If the min/max value of a timestamp column chunk is during the hour of the 
> Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can 
> drop row groups that contain rows that would be "ok" for the predicate 
> otherwise.
> To reproduce (on current master branch):
> {code}
> 1. it is assumed that the timezone is CET and that flag 
> convert_legacy_hive_parquet_utc_timestamps is enabled
> ( export TZ=CET; bin/start-impala-cluster.py 
> --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
> 2. create a table in hive and fill data in 3 inserts to create 3 files:
> create table t (i int, d timestamp) stored as parquet;
> insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
> insert into t values (3, "2018-10-28 02:30:00");
> insert into t values (4, "2017-10-29 02:30:00")
> 3. Query from Impala
> set num_nodes=1;
> select * from t; -- returns all 4 values (same as Hive) 
> select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive 
> returns 1,4)
> select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive 
> returns 2,3)
> profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been 
> stat filtered)
> select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 
> in Impala (same as Hive), because the "or" part disabled stat filtering
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-7559) Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps

Reply via email to