On Mon, Jun 14, 2021 at 12:29 PM Wes McKinney <wesmck...@gmail.com> wrote:
> Hi Antoine — when there is no time zone specified, I do not think it is > appropriate to consider the data to refer to a specific moment in time > without applying an explicit time zone localization > I have a case for treating "naive" as "UTC": in my field (web development), "naive" is the only spec-compliant way to encode timestamps. As a canonical example, consider a web-traffic log-analysis application. Web servers record all events with UTC timestamps. But Grafana (or whatever) won't *display* in UTC: it'll display the timezone of the viewer looking at the page. So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone" field means, "how the data is *displayed*." The data isn't displayed as UTC. And if "naive = unknown moment in time", then it's also wrong to store these log timestamps as timezone-naive: the moments in time are well-known. Summed up: as a web developer, *I omit Arrow's "timezone" metadata for my UTC timestamps because different users and components of my system use different time zones*. Arrow's "timezone" metadata is not the place for me. I need Arrow to provide a timezone-agnostic UTC timestamp -- like MySQL's TIMESTAMP. When localizing data (adding a time zone when there was none previously), I > do not think we can assume that the data is already localized to UTC. I > provided a gist showing the behavior of the pandas tz_localize function — > the int64 values must each be shifted by the UTC offset at that moment. > That’s what I think we have to do in this project. The question is: what should we call an int64-encoded datetime? "timestamp" or "int64"? I think "int64", because *most programming languages and libraries agree timestamps are UTC*. They use different*, *struct-based types for datetimes. Languages Arrow cares about: * C and C-built languages like Python, Ruby, etc. store 64-bit integers time_t as UTC. They have struct tm or similar tuples for date+time (+ sometimes timezone). * Go and Rust timestamps are UTC integers; datetimes are structs. * Java Instant stores UTC integer. Its LocalDateTime is a struct holding LocalDate + LocalTime fields. * Julia I don't know * JavaScript Date stores UTC integer; its best approximation of datetime is ISO8601-formatted strings. * Numpy is the black sheep: its datetime64 is *always* int64-encoded datetime, *never* UTC timestamp. But Pandas Timestamp, built atop it, is always UTC timestamp. It's unanimous: of all Arrow-supported languages, any developer who happens upon an int64 time-related value can assume it's UTC. Numpy stands out as a lone exception. I think it would be confusing for Arrow "timestamp" columns to allow int64-encoded datetime because that pattern is a wide deviation from the norm. I believe more Arrow users would get more done more quickly if int64-encoded datetimes were documented as a pattern one can use with "int64" columns, not a pattern one can use with "timestamp" columns. > If you know that the > data is UTC, then the correct action is to call tz_localize(‘UTC’) and then > tz_convert(tz) where tz is the intended time zone (which is only a > modification to the type metadata). My interpretation is certainly colored > by the experience of designing this functionality in pandas, but after 10 > years of observing real world use this model seems to work well and not > trip people up too much. > *I'll raise my hand here*: I got tripped up last week. I interpreted tz_localize() to do literally the opposite of what it does, after reading and re-reading the doc, and posted a misleading comment in a JIRA ticket. I've been using Pandas and training users for five years. That isn't the worst: the prior five years, I misinterpreted Postgres's TIMESTAMP WITHOUT TIME ZONE to be UTC. Again, I was the opposite of correct. (TIMESTAMP WITHOUT TIME ZONE is an int64-encoded datetime; the way to store UTC timestamps is TIMESTAMP WITH TIME ZONE, which doesn't store a time zone.) I'm a smart person. I keep making these embarrassing -- and costly -- mistakes. I've never been tripped up by java.time.Instant. It's no wonder Java embraced it. I hope Arrow empowers its community to make tools that make me feel not-stupid. Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com