Github user zivanfi commented on the issue:
https://github.com/apache/spark/pull/19250
Hive and Impala introduced the following workaround for timestamp
interoperability a long ago: The footer of the Parquet file contains metadata
about the library that wrote the file. For Hive and Spark this value is
parquet-mr, for Impala it is impala itself, since it has its own
implementation. Since Hive and Spark writes using UTC-normalized semantics and
Impala writes using timezone-agnostic semantics, we can deduce the used
semantics from the writer info. So, when Hive sees a Parquet file written by
Impala, it will adjust timestamps to compensate for the difference. Impala has
an option
([-convert_legacy_hive_parquet_utc_timestamp](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html))
to do the same when it sees a Parquet file written by anything else than
Impala.
There are two problems with this workaround:
- Spark does not have a similar logic, so while Impala<->Hive and
Hive<->SparkSQL can read each other's timestamps, the Impala->SparkSQL
direction does not work. If SparkSQL implemented such a writer-dependent
adjustment logic, it would already improve interoperability significantly.
- The adjustment depends on the local timezone. This can be problematic if
a table contains timestamps of mixed semantics, i.e. some data was written by
Impala and some data was written by Hive or SparkSQL. In that case, if the
local time used for reading differs from the local time used for writing, some
of the timestamps will change and some others won't. Both of these behaviors
are okay by themselves, since these are two valid timestamp semantics, but both
of them happening on a single table is very counter-intuitive. This would be
especially problematic for SparkSQL, since SparkSQL has a session-local
timezone setting (Impala and Hive use the server timezone, which tends to
remain unchanged).
To address both of these issues, we both added the recognition and
adjustment of Impala-written timestamps to SparkSQL and also added a table
property to record the timezone that should be used for these adjustments, so
that mixed table do not lead to unintuitive behaviour any more. We also added
this table property to Impala and Hive logic and tested that they can correctly
read each other's timestamps.
However, our initial commit (which was the first of two commits and was
meant to be followed by a follow-up change) got reverted in Spark due to some
concerns. Because the table property only provides interoperability if
respected by all affected components, we reverted our changes to Hive and
Impala as well until we can reach an agreement with Spark.
To address the concerns that lead to Reynold to revert our initial commit,
Imran made three changes compared to our original proposal:
- The adjustment logic was moved to the analyzer.
- The writer-specific logic was removed, all timestamps get the same
treatment regardless of the component that wrote them. As a result of this, the
code became simpler and nicer at the price of a behaviour change: Since
Impala-written timestamps are already timezone-agnostic, the user now has to
specify UTC in the table property for that table (earlier it didn't matter). It
also means that it is no longer possible to fix a table that already has mixed
semantics content, since you can not set the table property to UTC as that
would make Hive/Spark timestamp wrong and you can't set it to the local
timezone either because that would make the Impala timestamps wrong. Although
more restricting than our initial proposal, this still seems acceptable, since
most existing tables are single-writer only, and once you set the table
property all writers will respect it, so after you set the table property you
can have mixed-writer tables (they won't become mixed-semantics due to the
table pr
operty).
- The adjustment logic was made file-format-agnostic so that it does not
only apply to Parquet but to any kinds of tables. However, we then realized
that this will lead to further interoperability problems, thereby we would like
to stick with a Parquet-specific approach as we originally proposed to avoid
making the situation worse.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]