Rei Mai created ORC-451:
---------------------------
Summary: Timestamp statistics is wrong if read with
useUTCTimestamp=true
Key: ORC-451
URL: https://issues.apache.org/jira/browse/ORC-451
Project: ORC
Issue Type: Bug
Affects Versions: 1.5.0
Environment: timezone for both client and server "Europe/Moscow"
(UTC+3)
hive version 3.1.0.3.0.1.0-187
Reporter: Rei Mai
Attachments: 000000_0
We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both
client and server. After switching to hive 3 which uses orc 1.5.x we've got an
issue with predicate push down filtering out matching stripes by timestamp.
E.g. consider a table (it's orc data is in the attachment):
{quote}{{create external table test_ts (ts timestamp) stored as orc;}}
{{insert into test_ts values ("2018-12-24 18:30:00");}}
{{// No rows selected}}
{{select * from test_ts where ts < "2018-12-24 19:00:00";}}
// the lowest filter to return the value
{{select * from test_ts where ts <= "2018-12-24 21:30:00";}}
{quote}
The issue only affect external orc tables statistics. Turning ppd off with _set
hive.optimize.index.filter=false;_ helps.
We believe it was the https://jira.apache.org/jira/browse/ORC-341, which
introduced it.
org.apache.orc.impl.SerializationUtils utc convertion is rather strange:
{quote}public static long convertToUtc(TimeZone local, long time) {
int offset = local.getOffset(time);
return time + offset;
}
{quote}
This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it
substract 3 hours, btw?).
If org.apache.orc.impl.TimestampStatisticsImpl is used with
useUTCTimestamp=false, the timestamp is converted back in a compatible way via
SerializationUtils.convertFromUtc. But hive seems to override default
org.apache.orc.OrcFile.ReaderOptions with
org.apache.hadoop.hive.ql.io.orc.ReaderOptions which have useUTCTimestamp(true)
in it's constructor. With useUTCTimestamp=true evaluatePredicateProto
predictate is using TimestampStatisticsImpl.getMaximumUTC(), which returns the
timestamp as is, i.e. in the example it's "{{2018-12-24 21:30:00 }}UTC+3".
At the same time the predicate is not shifted (the value in this tez log is in
UTC+3):
{quote}2018-12-24 22:12:16,205 [INFO] [InputInitializer \{Map 1} #0]
|orc.OrcInputFormat|: ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24
19:30:00.0), expr = leaf-0
{quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)