Re: [Format][Important] Needed clarification of timezone-less timestamps

Micah Kornfield Mon, 21 Jun 2021 19:54:16 -0700

I think comments on the doc are tailing off.  Jorge's test cases I think
still need some more careful analysis but Weston has provided an
initial pass.


The matter not resolved on the document is whether Timestamp with timezone
logically represents multi-field date and time (that does not represent a
specific instant) or whether it logically represents an instant (some
measurable offset from an epoch).   Based on comments on the documentation
both C++/Python implementations and the Java implementations (those that
have generally been considered "reference") both have evidence the the
former representation is what is intended (some links are in the document).

We can probably continue to debate what is useful but it seems ultimately
we need to pick one or the other and clarify the specification.  Given how
the reference implementations currently work I think we should error on the
side of interpreting these values as date times.  Ultimately, given the
contention here we will likely need to vote on this.

More comments on the document or here are still useful in case we've missed
an interpretation or there are other facts to consider.

Cheers,
Micah

On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
[email protected]> wrote:

> Thank you everyone for participating so far; really important and
> useful discussion.
>
> I think of this discussion as a set of test cases over behavior:
>
> parameterization:
> * Timestamp(ms, None)
> * Timestamp(ms, "00:00")
> * Timestamp(ms, "01:00")
>
> Cases:
> * its string representation equals to
> * add a duration equals to
> * add an interval equals to
> * subtract a Timestamp(ms, None) equals to
> * subtract a Timestamp(ms, "01:00") equals to
> * subtract a Date32 equals to
> * subtract a Time32(ms) equals to
> * extract the day equals to
> * extract the timezone equals to
> * cast to Timestamp(ms, None) equals to
> * cast to Timestamp(ms, "01:00") equals to
> * write to parquet v2 equals to (physical value and logical type)
>
> In all cases, the result may either be valid or invalid. If valid, we
> would need a datatype and an actual value.
> I was hoping to be able to answer each of the above at the end of this
> discussion.
>
> I've suggested adding these in the google docs.
>
> Best,
> Jorge
>
> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <[email protected]>
> wrote:
> >
> > I've posted the examples above in
> >
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > because I think it would be better to collaborate there instead of linear
> > e-mail history and then bring the consensus back to the list.
> >
> > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > I feel like we might still be talking past each other here or at least
> I
> > > don't understand the two sides of this.  I'll try to expand Weston's
> > > example because I think it provides the best clarification.
> > >
> > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> assuming
> > > ms) for a timestamp column without timezone (always).   This
> represents an
> > > offset from the unix epoch.  This interpretation should not change
> based on
> > > the local system timezone.  Extracting the hour field always yields 14
> > > (extraction is done relative to UTC).
> > >
> > > The alternative here seems to be that we can encode (1970, 1, 2, 14,
> 0) in
> > > multiple different ways depending on what the current local system time
> > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > confusion/misinterpretation when trying to transfer data.
> > >
> > > If we then convert this column to a timestamp with a timezone in "UTC"
> > > timezone extracting the hour field still yields 14.  If the column is
> > > converted to Timezone with timestamp PST.  Extracting an hour would
> yield 6
> > > (assume PST = -8GMT).    Through all of these changes the data bits do
> not
> > > change.
> > >
> > > Display is not mentioned because I think the points about how a time
> > > display is correct. Applications can choose what they feel makes sense
> to
> > > them (as long as they don't start automatically tacking on timezones to
> > > naive timestamps).  My interpretation of the specification has been
> display
> > > was kind of shorthand for field extraction.
> > >
> > > Could others on the thread confirm this is the issue up for debate?
> Are
> > > there subtleties/operations we need to consider?
> > >
> > > I also agree that we should document recommended conversion practices
> from
> > > other systems.
> > >
> > > -Micah
> > >
> > >
> > >  So let's invent a third way.  I could use
> > >> the first 16 bits for the year, the next 8 bits for the month, the
> > >> next 8 bits for the day of month, the next 8 bits for the hour, the
> > >> next 8 bits for the minute, and the remaining bits for the seconds.
> > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > >> 0x07B201020E000000.
> > >
> > > Aside, With some small variation this is what ZetaSql uses [2]
> > >
> > > [1]
> > >
> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > [2]
> > >
> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > >
> > >
> > >
> > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <[email protected]>
> wrote:
> > >
> > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <[email protected]>
> wrote:
> > >>
> > >> >
> > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > >> > with/without time zone — in some SQL implementations each slot can
> > >> > have a different time zone
> > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> taken
> > >> > or given as timezone local time"
> > >> >
> > >>
> > >> RDBMSs conflict (universally) with ANSI.
> > >>
> > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the
> epoch.
> > >> It has no timezone.
> > >>
> > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> Instant
> > >> since the epoch. It has no timezone.
> > >>
> > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> datetime" in
> > >> *function*, but not in implementation:
> > >>
> > >>    - MySQL DATETIME
> > >>    <
> > >>
> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > >> >
> > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > >>    - MSSQL
> > >>    <
> > >>
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > >> >
> > >>    uses 6, 7 or 8 bytes
> > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > >> different
> > >>    <
> > >>
> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > >> >
> > >> (it
> > >>    used to store doubles since 2000-01-01)
> > >>
> > >> ... so in general, moving datetimes from these systems into 64-bit
> > >> integers
> > >> is nontrivial and lossy.
> > >>
> > >> Spark / Databricks discusses how Spark handles this
> > >> >
> > >> >
> > >>
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> zone,
> > >> > and are wall clock timestamps." — not UTC-normalized
> > >> > * WITH TIME ZONE: "does not affect the physical point in time that
> the
> > >> > timestamp represents, as that is fully represented by the UTC time
> > >> > instant given by the other timestamp components"
> > >> >
> > >>
> > >> I don't use Spark, but I read that page twice. First reading, I got
> the
> > >> same thing out of it. But the second time I read it, I read the
> opposite!
> > >>
> > >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP
> WITH
> > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
> > >> WITH
> > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
> > >> type:
> > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > >>
> > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> TIMESTAMP
> > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> > >> timestamps as bare 64-bit integers since the epoch -- without
> timezone.
> > >>
> > >>
> > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > >> > interpreted as UTC-normalized, that would force all of these other
> > >> > systems (and more) to serialize their data to be UTC-normalized
> > >>
> > >>
> > >> Are those systems' 64-bit integers interoperable in the first place?
> > >>
> > >> As I understand it, there's a ton of variance out there when encoding
> > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> encoding
> > >> is one of many. As I mentioned in another thread, programming
> languages
> > >> all
> > >> use structs.
> > >>
> > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > >> >
> > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > >> > interpreted as UTC-normalized, that would force all of these other
> > >> > systems (and more) to serialize their data to be UTC-normalized
> (i.e.
> > >> > calling the equivalent of pandas's tz_localize function) when they
> > >> > convert to Arrow.
> > >>
> > >>
> > >> Alternatives:
> > >>
> > >>    - int64
> > >>    - date32+time64
> > >>    - date32+time32
> > >>
> > >> This seems very harmful to me, and will make data
> > >> > from these systems not accurately representable in Arrow and unable
> to
> > >> > be round-tripped.
> > >> >
> > >>
> > >> Certainly nobody wants to go backwards.
> > >>
> > >> We need to clarify: how do we store these *common* types -- MySQL
> > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> > >> Arrow?
> > >>
> > >> Secondarily, I think: how do we recommend users store *datetimes* in
> > >> Arrow?
> > >> (I'd expect this to be messier, since every system/language uses a
> > >> different byte structure.)
> > >>
> > >> Perhaps we can make a spreadsheet and look comprehensively at how many
> > >> > use cases would be disenfranchised by requiring UTC normalization
> > >> > always.
> > >>
> > >>
> > >> Hear, hear!
> > >>
> > >> Can we also poll people to find out how they're storing Instants
> today?
> > >>
> > >> Enjoy life,
> > >> Adam
> > >>
> > >> --
> > >> Adam Hooper
> > >> +1-514-882-9694
> > >> http://adamhooper.com
> > >>
> > >
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to