Hi folks!

 I am working on the Parquet writer for new timestamp formats
(IMPALA-5051), and I have a dilemma about the way to reduce a timestamp's
precision from nanosecond to milli or microsecond. I have to choose between
consistency with Hive vs Impala itself:

- Impala currently rounds timestamps to microseconds when writing Kudu
tables (with some extra hacking near year 10000 to avoid rounding to an
invalid timestamp). This was implemented in IMPALA-5137.

- Hive seems to truncate timestamps towards negative infinity when it has
to reduce precision.

I lean towards truncating - theoretically rounding introduces smaller
error, but it can move the timestamp to a different day / DST rule / year,
which can cause much bigger differences in some queries. Truncating towards
negative infinity also seems simpler and faster, as it only needs an
integer division on the time_ part of Impala's TimestampValue and doesn't
need special handling for near edge values like "9999-12-31
23:59:59.999999999".

My proposal is to go with truncation in the Parquet writer, and consider
switching the Kudu writer too, maybe in the next major release.

Csaba

Reply via email to