Csaba Ringhofer created IMPALA-7567:
---------------------------------------

             Summary: Implement timezone aware parquet stat filtering for 
timestamp columns
                 Key: IMPALA-7567
                 URL: https://issues.apache.org/jira/browse/IMPALA-7567
             Project: IMPALA
          Issue Type: Bug
            Reporter: Csaba Ringhofer


Parquet timestamp columns can contain UTC normalized data, which means that the 
data is stored in UTC but it is expected to be shown  in local time (to be 
consistent with Hive). This is done by converting these timestamp from UTC to 
local time during scanning.

This conversion has to be considered during min/max stat filtering, otherwise 
some row groups can be incorrectly skipped. For this reason IMPALA-7559 
disables stat filtering on UTC normalized timestamp columns. 

This ticket deals with creating a correct implementation to be able re-enable 
stat filtering for these columns.

DST and historical rule changes add some complexity to this. UTC->local mapping 
can be non-monotonous, and  local->UTC mapping can be ambiguous. The 
non-monotonous mapping means that if tMin <= t <= tMax is true in UTC does not 
imply that the same is true in local time.

The solution I see is to convert min/max of the predicate from local to UTC and 
resolve ambiguity by  choosing the earlier time in case of min, and the later 
time in case of max. These UTC values can be compared with stats safely.

Note the timezone rules can be different in Hive and Impala (especially 
historical ones), so we cannot ensure that Impala gives exactly the same 
results as Hive. The goal is to ensure that Impala returns the same rows with 
and without stat filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to