Philip Zeyliger created IMPALA-7730:
---------------------------------------
Summary: Improve ORC File Format Timezone issues
Key: IMPALA-7730
URL: https://issues.apache.org/jira/browse/IMPALA-7730
Project: IMPALA
Issue Type: Task
Components: Backend
Affects Versions: Impala 3.0
Reporter: Philip Zeyliger
As pointed out in https://gerrit.cloudera.org/#/c/11731 by [~csringhofer], our
support for the ORC file format doesn't follow the same timezone conventions as
the rest of Impala.
{quote}
tldr: ORC's timezone handling is likely to be broken in Impala so we should
patch it in the toolchain
The ORC library implements its own IANA timezone handling to convert stored
timestamps from UTC to local time + do something similar for min/max stats. The
writer's timezone can be also stored in .orc files and used instead of local
timezone.
Impala's and ORC library's timezone can be different because of several reasons:
ORC's timezone is not overridden by env var TZ and query option timezone
ORC uses a simpler way to detect the local timezone which may not work on some
Linux distros (see TimezoneDatabase::LocalZoneName in Impala vs LOCAL_TIMEZONE
in Orc)
.orc files can use any time zone as writer's timezone and we cannot be sure
that it will exist on the reader machine
My suggestion is to patch the ORC library in the toolchain and remove timezone
handling (e.g. by always using UTC, maybe depending on a flag), as the way it
is currently working is likely to be broken and is surely not consistent with
the rest of Impala.
I am not sure how timezones could be handled correctly in Orc + Impala. If
someone plans to work on it, I would gladly help in the integration to Impala.
{quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]