Re: [Format][Important] Needed clarification of timezone-less timestamps

Adam Hooper Thu, 17 Jun 2021 13:58:10 -0700

On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> wrote:


>
> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> with/without time zone — in some SQL implementations each slot can
> have a different time zone
> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> or given as timezone local time"
>

RDBMSs conflict (universally) with ANSI.

PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
It has no timezone.

MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
since the epoch. It has no timezone.

TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
*function*, but not in implementation:

   - MySQL DATETIME
   
<https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html>
   is weird: 1-bit sign, 17-bit month, 5-bit day, ....
   - MSSQL
   
<https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15>
   uses 6, 7 or 8 bytes
   - PostgreSQL stores an integer, but I think its epoch is still different
   
<https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h>
(it
   used to store doubles since 2000-01-01)

... so in general, moving datetimes from these systems into 64-bit integers
is nontrivial and lossy.

Spark / Databricks discusses how Spark handles this
>
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> and are wall clock timestamps." — not UTC-normalized
> * WITH TIME ZONE: "does not affect the physical point in time that the
> timestamp represents, as that is fully represented by the UTC time
> instant given by the other timestamp components"
>

I don't use Spark, but I read that page twice. First reading, I got the
same thing out of it. But the second time I read it, I read the opposite!

The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP WITH
TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp type:
a 64-bit Instant since the epoch. (It also has a Date type.)

If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
timestamps as bare 64-bit integers since the epoch -- without timezone.


> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> interpreted as UTC-normalized, that would force all of these other
> systems (and more) to serialize their data to be UTC-normalized


Are those systems' 64-bit integers interoperable in the first place?

As I understand it, there's a ton of variance out there when encoding
datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
is one of many. As I mentioned in another thread, programming languages all
use structs.

pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> timestamps and UTC-normalized WITH TIME ZONE.
>
> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> interpreted as UTC-normalized, that would force all of these other
> systems (and more) to serialize their data to be UTC-normalized (i.e.
> calling the equivalent of pandas's tz_localize function) when they
> convert to Arrow.


Alternatives:

   - int64
   - date32+time64
   - date32+time32

This seems very harmful to me, and will make data
> from these systems not accurately representable in Arrow and unable to
> be round-tripped.
>

Certainly nobody wants to go backwards.

We need to clarify: how do we store these *common* types -- MySQL
TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in Arrow?

Secondarily, I think: how do we recommend users store *datetimes* in Arrow?
(I'd expect this to be messier, since every system/language uses a
different byte structure.)

Perhaps we can make a spreadsheet and look comprehensively at how many
> use cases would be disenfranchised by requiring UTC normalization
> always.


Hear, hear!

Can we also poll people to find out how they're storing Instants today?

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to