Hello everyone!

We (me and @felipecrv) would like to propose a new canonical extension type: 
"TimestampWithOffset". Before we start an official voting, we would like to 
discuss our proposal in this thread.

A draft of the format documentation change can be found at [1]. A copy of its 
text is attached under the FORMAT section.
A draft Go implementation can be found at [2].
A draft Rust implementation can be found at [3].


THE PROBLEM
---
"TIMESTAMP WITH TIME ZONE" is a standard ANSI SQL type that associates a 
timezone offset to each timestamp entry in a database. Many database systems 
support this data type (some use aliases). Some examples are Snowflake [4], 
Trino [5], Oracle DB [6] and MS SQL Server [7]. 


The current set of Arrow types can only keep one timezone that applies to the 
entire column. This limits the expressiveness of data when interacting with 
such SQL databases. Consumer systems currently need to either convert from 
source "(timestamp, timezone_offset)" to a normalized arrow UTC timestamp and 
throw away the original time zone information, or use bespoke formats if the 
time zone needs to be preserved. For example, the ADBC implementation for 
Snowflake currently implements the former [8].

Dropping the time zone has correctness implications for some applications. 
Consider a global business which performs monthly reporting. Each business unit 
is located in a different continent, and processes a number of orders every 
month. Each order is placed in an "orders" fact table, where "ordered_at" is a 
"TIMESTAMP WITH TIME ZONE" column. Now, assume there is an order that was 
placed at 23:00 (11pm) in California (UTC-8), on the 31st of January. A system 
that simply casts this timestamp to UTC and drops the time zone will think this 
order was placed in February, which is an incorrect assumption given the 
business' reporting needs of associating orders to the business unit where it 
was processed. There is no way to fetch the original time zone of the order, 
and using the local client time zone does not help either. And so one cannot 
generate a correct report without workarounds to preserve the time zone.

We think Arrow should have a canonical way representation for this.


THE PROPOSAL
---
We propose "TimestampWithOffset" to be a new canonical extension type that 
stores timestamps as "struct(timestamp=timestamp[time_unit=any, timezone=utc], 
offset_minutes=int16)", such that timestamps can have a per-row timezone offset 
instead of having one global timezone attached to the entire column.


CONSIDERATIONS AND LIMITATIONS
---
1. Why use a 16-bit integer offset in minutes?
In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" 
offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' 
HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can 
represent a time zone offset, and the maximum minutes in the offset is 
13*60=780, we believe it makes sense for the offset to be stored as a 16-bit 
integer in minutes. Nonetheless, 16-bits is large enough to fit a much wider 
offset

It is important to point out that some systems such as MS SQL Server do 
implement data types that can represent offsets with sub-minute granularity. We 
believe representing sub-minute granularity is out of scope for this proposal 
given that no current or past time zone standards have ever specified 
sub-minute offsets [9], and that is what we're trying to solve for. 
Furthermore, representing the offset in seconds rather than minutes would mean 
the maximum offset is 13*60*60=46800, which is greater than the maximum 
positive integer an int16 can represent (32768), and thus the offset type would 
need to be wider (int32).


2. This type can still be "lossy"
Systems like Trino and Oracle DB store the time zone information as an IANA 
time zone name, not as an "HOUR TO MINUTE" interval as specified by the ANSI 
SQL standard. This means the source system (or the arrow compatibility layer, 
such as ADBC) needs to cast time zone strings to the offset in minutes. In 
other words, the integer offset is calculated at the source, not at the 
consumer.

This means that the consumer cannot render the original IANA time zone string, 
and needs to use something like "UTC-03:00" instead of "America/Sao_Paulo", for 
example. It's impossible for the consumer to lookup the IANA time zone with 
only the offset, as that is a one-to-many mapping, and so the type is lossy 
with respect to the original IANA time zone.

There is an upside to this, which is reducing consumer complexity. It only 
needs to add an offset in minutes to the UTC date to get the original date in 
its time zone, without any access to the IANA time zone database nor performing 
complicated conversions, like reasoning about variable-offset time zones (e.g 
daylight savings).

3. JSON representation
We propose that the de/serialization to/from JSON must use RFC3339 strings 
[10], without loss of information. RFC3339 is a widely accepted format across 
programming languages and databases, and we argue encoding 
"TimestampWithOffset" with it would make JSON integration with external 
non-Arrow systems easier. It enables JSON consumers to decode to their own 
timezone-aware representation of timestamps (like Go's "time.Time" or 
JavaScript's "Date") by leveraging existing RFC3339 de/encoders without having 
to implement wrapper boilerplate just for Arrow.


RELEVANT LINKS
---
[1] Format specification pull request. 
https://github.com/apache/arrow/pull/48002
[2] Golang implementation draft. https://github.com/apache/arrow-go/pull/558
[3] Rust implementation draft. https://github.com/apache/arrow-rs/pull/8743
[4] Snowflake's TIMESTAMP_TZ. 
https://docs.snowflake.com/en/sql-reference/data-types-datetime#timestamp-ltz-timestamp-ntz-timestamp-tz
[5] Trino's TIMESTAMP WITH TIME ZONE. 
https://trino.io/docs/current/language/types.html#timestamp-p-with-time-zone
[6] Oracle's TIMESTAMP WITH TIME ZONE. 
https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/datetime-data-types-and-time-zone-support.html
[7] MS SQL Server's DATETIMEOFFSET. 
https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetimeoffset-transact-sql?view=sql-server-ver17
[8] ADBC for Snowflake converts to UTC and drops time zone. 
https://github.com/apache/arrow-adbc/blob/a67ab5a509676feaec8e24dba479d4de8dc083e2/go/adbc/driver/snowflake/record_reader.go#L228
[9] Current time zones in effect. 
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
[10] RFC3339 representation of timezone-aware timestamps. 
https://datatracker.ietf.org/doc/html/rfc3339


FORMAT
---
Timestamp With Offset
=============
This type represents a timestamp column that stores potentially different 
timezone offsets per value. The timestamp is stored in UTC alongside the 
original timezone offset in minutes.

* Extension name: ``arrow.timestamp_with_offset``.

* The storage type of the extension is a ``Struct`` with 2 fields, in order:

  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where 
``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) 
representing the offset in minutes from the UTC timezone. Negative offsets 
represent time zones west of UTC, while positive offsets represent east. 
Offsets range from -779 (-12:59) to +780 (+13:00).

* Extension type parameters:

  * ``time_unit``: the time-unit of each of the stored UTC timestamps.

* Description of the serialization:

  Extension metadata is an empty string.

  When de/serializing to/from JSON, this type must be represented as an RFC3339 
string, respecting the ``TimeUnit`` precision and time zone offset without loss 
of information. For example ``2025-01-01T00:00:00Z`` represents January 1st 
2025 in UTC with second precision, and ``2025-01-01T00:00:00.000000001-07:00`` 
represents one nanosecond after January 1st 2025 in UTC-07.

Attachment: publickey - [email protected] - 0x0A7793AD.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to