Csaba Ringhofer created IMPALA-7567:
---------------------------------------
Summary: Implement timezone aware parquet stat filtering for
timestamp columns
Key: IMPALA-7567
URL: https://issues.apache.org/jira/browse/IMPALA-7567
Project: IMPALA
Issue Type: Bug
Reporter: Csaba Ringhofer
Parquet timestamp columns can contain UTC normalized data, which means that the
data is stored in UTC but it is expected to be shown in local time (to be
consistent with Hive). This is done by converting these timestamp from UTC to
local time during scanning.
This conversion has to be considered during min/max stat filtering, otherwise
some row groups can be incorrectly skipped. For this reason IMPALA-7559
disables stat filtering on UTC normalized timestamp columns.
This ticket deals with creating a correct implementation to be able re-enable
stat filtering for these columns.
DST and historical rule changes add some complexity to this. UTC->local mapping
can be non-monotonous, and local->UTC mapping can be ambiguous. The
non-monotonous mapping means that if tMin <= t <= tMax is true in UTC does not
imply that the same is true in local time.
The solution I see is to convert min/max of the predicate from local to UTC and
resolve ambiguity by choosing the earlier time in case of min, and the later
time in case of max. These UTC values can be compared with stats safely.
Note the timezone rules can be different in Hive and Impala (especially
historical ones), so we cannot ensure that Impala gives exactly the same
results as Hive. The goal is to ensure that Impala returns the same rows with
and without stat filtering.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]