On Mon, Jun 14, 2021 at 2:47 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> I think the world is headed towards using canonical UTC timestamps (some
> period of time from the unix epoch at UTC) so that timestamps can be
> interpreted as absolute times without requiring additional metadata. This
> is not yet universal yet, but it seems to me to be where things are heading
> (and what new systems do)
>
> This feels like the same basic transition the world did from "strings" with
> encodings as metadata which can be messed up to "UTF-8"
>
> Thus, I prefer Antoine's interpretation that timestamp values are always
> relative to UTC and the timezone metadata can be used to render them to
> local times if desired

I will have to beg you all to give me some time to review all the
information when I am not on vacation, but Arrow has in essence two
timestamp data types:

TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
field is not set. We have stated that we want systems to use
system-locale-independent choices for functions that act on this data
(like stringification or field extraction)

TIMESTAMP WITH TIME ZONE: the time zone field is set. The storage is
UTC-normalized, and time zone changes are metadata only operations.

Localization is the action that converts between the first type to the
second type.

It sounds to me like it is being proposed to eliminate the first of
these two data types. I understand the principles that might motivate
that, but I don't think that is something we can do at this time lest
we lose the ability to have high-fidelity interoperability with other
systems. A system that uses Arrow is certainly free to exclusively use
TIMESTAMP WITH TIME ZONE in its implementation (and ensure that the
time zone field is always set to UTC or another non-UTC time zone).

> Andrew
>
> On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> > field
> > > means, "how the data is *displayed*." The data isn't displayed as UTC.
> >
> > I don't think users will generally be using Arrow to format timestamps
> > for display to the user.  However, if it is, the correct thing to do
> > here would be to store and transport timestamps with timezone=UTC.
> > Then, when it comes time to display, first convert to timezone=local
> > and then convert to string.
> >
> > > If you parse a timestamp string, then you can extract all of the fields
> > > (including hour and day) from the resulting int64 values and they will be
> > > the same as they appeared in the strings. Many users never need to worry
> > > about time zone isn’t their analyses.
> >
> > If Arrow required a timezone (didn't allow naive timestamps) then
> > users that truly don't care about the timezone could simply specify
> > UTC or any other timezone and carry on with their analysis.
> >
> > Personally, I think the safest option would be to only allow
> > timestamps to be stored with a timezone.   I'd agree with Antoine's
> > earlier point and say that the timezone should always be supplied at
> > the boundary.  However, it may be too late for that.
> >
> > Given that we have "timestamps without timezone" it seems to me the
> > safest thing is to consider them as naive and fail any function that
> > required the time zone.  So +1 for Joris' interpretation.  Yes, this
> > renders them useless for any purpose other than pass-through.  If a
> > user truly wants to do something with them then it seems the burden
> > should be on the user to supply a timezone and not for Arrow to infer
> > anything.
> >
> > -Weston
> >
> >
> > On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> > <jorisvandenboss...@gmail.com> wrote:
> > >
> > > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <anto...@python.org> wrote:
> > > >
> > > > ...
> > > >
> > > > Joris' interpretation is that timestamp *values* are expressed in an
> > > > arbitrary "local time" that is unknown and unspecified. It is therefore
> > > > difficult to exactly interpret them, since the timezone information is
> > > > unavailable.
> > > >
> > > > (I'll let Joris express his thoughts more accurately, but the gist of
> > > > his opinion is that "can be thought of as UTC" is only an indication,
> > > > not a prescription)
> > >
> > > That's indeed correct. One clarification: you can interpret them as
> > > is, and for many applications this is fine. It's only when you want to
> > > interpret them as an absolute point in time that the user needs to
> > > supply a timezone to interpret them.
> > >
> > > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > > maintainer, I of course have a similar perspective on this looking at
> > > this from the pandas implementation he wrote).
> > >
> > > An additional source that explains the "local semantics" of naive
> > > timestamps well IMO, and especially explains the "can be thought of as
> > > UTC without being UTC" aspect, is the parquet format docs:
> > >
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > > (it's of course about Parquet and not Arrow, but the explanation is
> > > relevant for the Arrow spec as well).
> > >
> > > Joris
> >

Reply via email to