On Mon, Jun 14, 2021 at 12:29 PM Wes McKinney <wesmck...@gmail.com> wrote:

> Hi Antoine — when there is no time zone specified, I do not think it is
> appropriate to consider the data to refer to a specific moment in time
> without applying an explicit time zone localization
>

I have a case for treating "naive" as "UTC": in my field (web development),
"naive" is the only spec-compliant way to encode timestamps.

As a canonical example, consider a web-traffic log-analysis application.
Web servers record all events with UTC timestamps. But Grafana (or
whatever) won't *display* in UTC: it'll display the timezone of the viewer
looking at the page.

So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone" field
means, "how the data is *displayed*." The data isn't displayed as UTC.

And if "naive = unknown moment in time", then it's also wrong to store
these log timestamps as timezone-naive: the moments in time are well-known.

Summed up: as a web developer, *I omit Arrow's "timezone" metadata for my
UTC timestamps because different users and components of my system use
different time zones*. Arrow's "timezone" metadata is not the place for me.
I need Arrow to provide a timezone-agnostic UTC timestamp -- like MySQL's
TIMESTAMP.

When localizing data (adding a time zone when there was none previously), I
> do not think we can assume that the data is already localized to UTC. I
> provided a gist showing the behavior of the pandas tz_localize function —
> the int64 values must each be shifted by the UTC offset at that moment.
> That’s what I think we have to do in this project.


The question is: what should we call an int64-encoded datetime? "timestamp"
or "int64"?

I think "int64", because *most programming languages and libraries agree
timestamps are UTC*. They use different*, *struct-based types for datetimes.

Languages Arrow cares about:

* C and C-built languages like Python, Ruby, etc. store 64-bit integers
time_t as UTC. They have struct tm or similar tuples for date+time (+
sometimes timezone).
* Go and Rust timestamps are UTC integers; datetimes are structs.
* Java Instant stores UTC integer. Its LocalDateTime is a struct holding
LocalDate + LocalTime fields.
* Julia I don't know
* JavaScript Date stores UTC integer; its best approximation of datetime is
ISO8601-formatted strings.
* Numpy is the black sheep: its datetime64 is *always* int64-encoded
datetime, *never* UTC timestamp. But Pandas Timestamp, built atop it, is
always UTC timestamp.

It's unanimous: of all Arrow-supported languages, any developer who happens
upon an int64 time-related value can assume it's UTC. Numpy stands out as a
lone exception.

I think it would be confusing for Arrow "timestamp" columns to allow
int64-encoded datetime because that pattern is a wide deviation from the
norm. I believe more Arrow users would get more done more quickly if
int64-encoded datetimes were documented as a pattern one can use with
"int64" columns, not a pattern one can use with "timestamp" columns.


> If you know that the
> data is UTC, then the correct action is to call tz_localize(‘UTC’) and then
> tz_convert(tz) where tz is the intended time zone (which is only a
> modification to the type metadata). My interpretation is certainly colored
> by the experience of designing this functionality in pandas, but after 10
> years of observing real world use this model seems to work well and not
> trip people up too much.
>

*I'll raise my hand here*: I got tripped up last week. I interpreted
tz_localize() to do literally the opposite of what it does, after reading
and re-reading the doc, and posted a misleading comment in a JIRA ticket.
I've been using Pandas and training users for five years.

That isn't the worst: the prior five years, I misinterpreted
Postgres's TIMESTAMP
WITHOUT TIME ZONE to be UTC. Again, I was the opposite of correct. (TIMESTAMP
WITHOUT TIME ZONE is an int64-encoded datetime; the way to store UTC
timestamps is TIMESTAMP WITH TIME ZONE, which doesn't store a time zone.)

I'm a smart person. I keep making these embarrassing -- and costly --
mistakes.

I've never been tripped up by java.time.Instant. It's no wonder Java
embraced it.

I hope Arrow empowers its community to make tools that make me feel
not-stupid.

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Reply via email to