Re: [Format][Important] Needed clarification of timezone-less timestamps

Micah Kornfield Thu, 17 Jun 2021 14:56:45 -0700

I feel like we might still be talking past each other here or at least I
don't understand the two sides of this.  I'll try to expand Weston's
example because I think it provides the best clarification.


(1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming
ms) for a timestamp column without timezone (always).   This represents an
offset from the unix epoch.  This interpretation should not change based on
the local system timezone.  Extracting the hour field always yields 14
(extraction is done relative to UTC).

The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
multiple different ways depending on what the current local system time
is.  As a note, I think ORC and Spark do this, and it leads to
confusion/misinterpretation when trying to transfer data.

If we then convert this column to a timestamp with a timezone in "UTC"
timezone extracting the hour field still yields 14.  If the column is
converted to Timezone with timestamp PST.  Extracting an hour would yield 6
(assume PST = -8GMT).    Through all of these changes the data bits do not
change.

Display is not mentioned because I think the points about how a time
display is correct. Applications can choose what they feel makes sense to
them (as long as they don't start automatically tacking on timezones to
naive timestamps).  My interpretation of the specification has been display
was kind of shorthand for field extraction.

Could others on the thread confirm this is the issue up for debate?  Are
there subtleties/operations we need to consider?

I also agree that we should document recommended conversion practices from
other systems.

-Micah


 So let's invent a third way.  I could use
> the first 16 bits for the year, the next 8 bits for the month, the
> next 8 bits for the day of month, the next 8 bits for the hour, the
> next 8 bits for the minute, and the remaining bits for the seconds.
> Using this method I would store (1970, 1, 2, 14, 0) as
> 0x07B201020E000000.

Aside, With some small variation this is what ZetaSql uses [2]

[1] https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
[2]
https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62



On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> wrote:

> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> >
> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > with/without time zone — in some SQL implementations each slot can
> > have a different time zone
> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> > or given as timezone local time"
> >
>
> RDBMSs conflict (universally) with ANSI.
>
> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
> It has no timezone.
>
> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
> since the epoch. It has no timezone.
>
> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
> *function*, but not in implementation:
>
>    - MySQL DATETIME
>    <
> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >
>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>    - MSSQL
>    <
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >
>    uses 6, 7 or 8 bytes
>    - PostgreSQL stores an integer, but I think its epoch is still different
>    <
> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >
> (it
>    used to store doubles since 2000-01-01)
>
> ... so in general, moving datetimes from these systems into 64-bit integers
> is nontrivial and lossy.
>
> Spark / Databricks discusses how Spark handles this
> >
> >
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> > and are wall clock timestamps." — not UTC-normalized
> > * WITH TIME ZONE: "does not affect the physical point in time that the
> > timestamp represents, as that is fully represented by the UTC time
> > instant given by the other timestamp components"
> >
>
> I don't use Spark, but I read that page twice. First reading, I got the
> same thing out of it. But the second time I read it, I read the opposite!
>
> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP WITH
> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp type:
> a 64-bit Instant since the epoch. (It also has a Date type.)
>
> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> timestamps as bare 64-bit integers since the epoch -- without timezone.
>
>
> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > interpreted as UTC-normalized, that would force all of these other
> > systems (and more) to serialize their data to be UTC-normalized
>
>
> Are those systems' 64-bit integers interoperable in the first place?
>
> As I understand it, there's a ton of variance out there when encoding
> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
> is one of many. As I mentioned in another thread, programming languages all
> use structs.
>
> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > timestamps and UTC-normalized WITH TIME ZONE.
> >
> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > interpreted as UTC-normalized, that would force all of these other
> > systems (and more) to serialize their data to be UTC-normalized (i.e.
> > calling the equivalent of pandas's tz_localize function) when they
> > convert to Arrow.
>
>
> Alternatives:
>
>    - int64
>    - date32+time64
>    - date32+time32
>
> This seems very harmful to me, and will make data
> > from these systems not accurately representable in Arrow and unable to
> > be round-tripped.
> >
>
> Certainly nobody wants to go backwards.
>
> We need to clarify: how do we store these *common* types -- MySQL
> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> Arrow?
>
> Secondarily, I think: how do we recommend users store *datetimes* in Arrow?
> (I'd expect this to be messier, since every system/language uses a
> different byte structure.)
>
> Perhaps we can make a spreadsheet and look comprehensively at how many
> > use cases would be disenfranchised by requiring UTC normalization
> > always.
>
>
> Hear, hear!
>
> Can we also poll people to find out how they're storing Instants today?
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to