Re: [Format][Important] Needed clarification of timezone-less timestamps

Weston Pace Mon, 14 Jun 2021 22:34:47 -0700

In retrospect, I should probably go for a long run before I write an
email and not after.  I had some time to mull it over and I realize
now that I was wrong.  My vote now changes to +1 for Antoine.  Can
someone please verify my understanding of the two formats below?


---
Let's pretend two astronomers observe a meteoroid impact on the moon.
We are talking about two different ways they can record the time.  The
first method, universal time, is done by recording the seconds since
the epoch.  The second, wall clock time, is done by writing down the
time seen on a clock (and nearby calendar).

In both cases we do not know the full picture without the time zone
information.  If we have two universal times (but no time zones) we
can say whether the two astronomers witnessed the same event (assuming
the impact site is equal) but we can't say whether they saw it at the
same time of day (e.g. whether the two astronomers had both just
finished dinner).

If we have two wall clock times (but no time zones) we can say whether
the two astronomers witnessed the impact at the same time of day but
we can't say if they witnessed the same event.

Rather than store wall clock time as a string (which is inefficient)
Arrow stores wall clock time as the epoch timestamp at the point a
wall clock in the UTC time zone would display the given time.  In
other words, converting datetime.datetime.now to an Arrow timestamp
does NOT give the current UNIX epoch.  The value that is stored is
different for every time zone.  Or to put it yet another way.  The
output of the following program...

import pyarrow as pa
import datetime
pa.array([datetime.datetime.strptime('Jun 28 2018 7:40AM',
         '%b %d %Y %I:%M%p')]).cast(pa.int64()).to_pylist()[0]

...will be identical on every machine.  But the output of...

import pyarrow as pa
import datetime
pa.array([datetime.datetime.now()]).cast(pa.int64()).to_pylist()[0]

...will depend on the system time zone (ostensibly because the output
of datetime.datetime.now() depends on the system time zone).
---

So given my previous concrete example I said...

> For each observation they record the unix timestamp (or maybe
> they build up an instance of datetime objects created with
> datetime.datetime.now())

These two methods would actually yield different results.  If they
created a pa.array([ts1, ts2], type=pa.timestamp('s')) with unix
timestamps recorded at the time of the event then they would get the
wrong histogram.

If they created a pa.array([dt1, dt2], type=pa.timestamp('s')) with
datetime.datetime objects created with datetime.datetime.now at the
time of the event then they would get the correct histogram.

On Mon, Jun 14, 2021 at 5:34 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> I'm in no rush, so feel free to respond when you have time.
>
> > If the timezone field doesn't say how to display data to the user, and we
> > agree it doesn't describe how data is stored (since its very presence means
> > data is stored as UTC) ... well ... what *is* the meaning of the timezone
> > field?
>
> In retrospect, my comment (Arrow probably isn't used for the final
> formatting to the user) isn't relevant to the discussion and I agree
> with your original point.    Even if we aren't supporting "display" I
> do think these exact sort of "formatting of time zone" type tasks are
> useful for analysis.  So it does seem like something Arrow will need
> to consider.  For example, creating a histogram by day-of-week
> requires formatting a timestamp into a weekday which requires a time
> zone.   This is what the timezone field is used for.
>
> >  TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > field is not set. We have stated that we want systems to use
> > system-locale-independent choices for functions that act on this data
> > (like stringification or field extraction)
>
> This is indeed a rehash of an earlier discussion where I agreed with
> you but I think I understand the subtleties a bit more and now I
> disagree, particularly on field extraction.  Field extraction can be
> done on a naive "datetime" without assuming UTC which I think makes it
> safer for Python.  Field extraction cannot be done on a naive
> "timestamp" without assuming UTC.
>
> # Stringification
>
> I think we can get away with stringification.  It seems like the
> consensus is to always output UTC format.  I will point out that
> pyarrow does not do that today.  Currently in pyarrow I get
>
> >>> pa.array([datetime.datetime.now()])
> <pyarrow.lib.TimestampArray object at 0x7f8ae865d520>
> [
>   2021-06-14 17:30:52.260044  # Local time
> ]
>
> # Field extraction
>
> Here is a concrete example demonstrating the problems of field
> extraction.  Consider a user that runs an experiment over several
> weeks.  For each observation they record the unix timestamp (or maybe
> they build up an instance of datetime objects created with
> datetime.datetime.now()).  Then, using Arrow as a backend for
> analysis, they create a histogram to show events by weekday.   If
> Arrow is assuming UTC then the histogram is going to have the wrong
> days of the week (unless the user happens to be in UTC).
>
> Simple queries like "Give me all events that happened on Tuesday" or
> "Group rows by year" will not necessarily work on naive columns in the
> way that a user expects (and yet these only require field extraction).
>
> So, my particular resolution (what I am arguing for), is that arrow
> libraries that perform field extraction should return an error when
> presented with a timestamp that does not have a timezone.
>
> On Mon, Jun 14, 2021 at 4:45 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
> >
> > >
> > > I will have to beg you all to give me some time to review all the
> > > information when I am not on vacation, but Arrow has in essence two
> > > timestamp data types
> >
> >
> > This is how I always interpreted with and without timezone (really two
> > distinct types).  I also thought we had covered this on the prior thread
> > with Julian, but I guess we never reached consensus.  I think if we had to
> > do it over again, perhaps different modelling would have made this clearer
> > (e.g. storing separate fields (year, month, day, etc) for naive timestamps.
> >
> > If the timezone field doesn't say how to display data to the user, and we
> > > agree it doesn't describe how data is stored (since its very presence 
> > > means
> > > data is stored as UTC) ... well ... what *is* the meaning of the timezone
> > > field?
> >
> >
> > I think it just so happens this is mostly a hold-over from Pandas and for
> > some reason not everyone in the community looked too closely at it.  A long
> > time ago there was a thread of possibly introducing a per slot/cell
> > timezone like type as well but there hasn't seemed to be a lot of interest.
> >
> > (In my opinion, there shouldn't be a field at all.)
> >
> > This really isn't an option at this point due to compatibility guarantees,
> > at best we could discourage use.
> >
> > On Mon, Jun 14, 2021 at 1:38 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > On Mon, Jun 14, 2021 at 2:47 PM Andrew Lamb <al...@influxdata.com> wrote:
> > > >
> > > > I think the world is headed towards using canonical UTC timestamps (some
> > > > period of time from the unix epoch at UTC) so that timestamps can be
> > > > interpreted as absolute times without requiring additional metadata. 
> > > > This
> > > > is not yet universal yet, but it seems to me to be where things are
> > > heading
> > > > (and what new systems do)
> > > >
> > > > This feels like the same basic transition the world did from "strings"
> > > with
> > > > encodings as metadata which can be messed up to "UTF-8"
> > > >
> > > > Thus, I prefer Antoine's interpretation that timestamp values are always
> > > > relative to UTC and the timezone metadata can be used to render them to
> > > > local times if desired
> > >
> > > I will have to beg you all to give me some time to review all the
> > > information when I am not on vacation, but Arrow has in essence two
> > > timestamp data types:
> > >
> > > TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > > field is not set. We have stated that we want systems to use
> > > system-locale-independent choices for functions that act on this data
> > > (like stringification or field extraction)
> > >
> > > TIMESTAMP WITH TIME ZONE: the time zone field is set. The storage is
> > > UTC-normalized, and time zone changes are metadata only operations.
> > >
> > > Localization is the action that converts between the first type to the
> > > second type.
> > >
> > > It sounds to me like it is being proposed to eliminate the first of
> > > these two data types. I understand the principles that might motivate
> > > that, but I don't think that is something we can do at this time lest
> > > we lose the ability to have high-fidelity interoperability with other
> > > systems. A system that uses Arrow is certainly free to exclusively use
> > > TIMESTAMP WITH TIME ZONE in its implementation (and ensure that the
> > > time zone field is always set to UTC or another non-UTC time zone).
> > >
> > > > Andrew
> > > >
> > > > On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <weston.p...@gmail.com>
> > > wrote:
> > > >
> > > > > > So it's wrong to put "timezone=UTC", because in Arrow, the 
> > > > > > 'timezone"
> > > > > field
> > > > > > means, "how the data is *displayed*." The data isn't displayed as
> > > UTC.
> > > > >
> > > > > I don't think users will generally be using Arrow to format timestamps
> > > > > for display to the user.  However, if it is, the correct thing to do
> > > > > here would be to store and transport timestamps with timezone=UTC.
> > > > > Then, when it comes time to display, first convert to timezone=local
> > > > > and then convert to string.
> > > > >
> > > > > > If you parse a timestamp string, then you can extract all of the
> > > fields
> > > > > > (including hour and day) from the resulting int64 values and they
> > > will be
> > > > > > the same as they appeared in the strings. Many users never need to
> > > worry
> > > > > > about time zone isn’t their analyses.
> > > > >
> > > > > If Arrow required a timezone (didn't allow naive timestamps) then
> > > > > users that truly don't care about the timezone could simply specify
> > > > > UTC or any other timezone and carry on with their analysis.
> > > > >
> > > > > Personally, I think the safest option would be to only allow
> > > > > timestamps to be stored with a timezone.   I'd agree with Antoine's
> > > > > earlier point and say that the timezone should always be supplied at
> > > > > the boundary.  However, it may be too late for that.
> > > > >
> > > > > Given that we have "timestamps without timezone" it seems to me the
> > > > > safest thing is to consider them as naive and fail any function that
> > > > > required the time zone.  So +1 for Joris' interpretation.  Yes, this
> > > > > renders them useless for any purpose other than pass-through.  If a
> > > > > user truly wants to do something with them then it seems the burden
> > > > > should be on the user to supply a timezone and not for Arrow to infer
> > > > > anything.
> > > > >
> > > > > -Weston
> > > > >
> > > > >
> > > > > On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> > > > > <jorisvandenboss...@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > Joris' interpretation is that timestamp *values* are expressed in
> > > an
> > > > > > > arbitrary "local time" that is unknown and unspecified. It is
> > > therefore
> > > > > > > difficult to exactly interpret them, since the timezone
> > > information is
> > > > > > > unavailable.
> > > > > > >
> > > > > > > (I'll let Joris express his thoughts more accurately, but the gist
> > > of
> > > > > > > his opinion is that "can be thought of as UTC" is only an
> > > indication,
> > > > > > > not a prescription)
> > > > > >
> > > > > > That's indeed correct. One clarification: you can interpret them as
> > > > > > is, and for many applications this is fine. It's only when you want
> > > to
> > > > > > interpret them as an absolute point in time that the user needs to
> > > > > > supply a timezone to interpret them.
> > > > > >
> > > > > > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > > > > > maintainer, I of course have a similar perspective on this looking 
> > > > > > at
> > > > > > this from the pandas implementation he wrote).
> > > > > >
> > > > > > An additional source that explains the "local semantics" of naive
> > > > > > timestamps well IMO, and especially explains the "can be thought of
> > > as
> > > > > > UTC without being UTC" aspect, is the parquet format docs:
> > > > > >
> > > > >
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > > > > > (it's of course about Parquet and not Arrow, but the explanation is
> > > > > > relevant for the Arrow spec as well).
> > > > > >
> > > > > > Joris
> > > > >
> > >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to