Generally, seems like a good idea two questions: 1. I didn't think we generally specify JSON serialization formats for types? 2. I assume if there is a high enough demand we can always have a different variant that uses IANA strings? 3. Will dictionary encoding of the time zone component be allowed (for common cases I would expect one could save 1 byte but this is perhaps over optimization).
Thanks, Micah On Tue, Nov 4, 2025 at 5:25 AM Weston Pace <[email protected]> wrote: > +1 > > This has always felt like a missing option and hopefully its presence will > actually make the intent of the other timestamp types clearer. > > On Thu, Oct 30, 2025 at 2:04 PM serramatutu <[email protected]> > wrote: > > > Hello everyone! > > > > We (me and @felipecrv) would like to propose a new canonical extension > > type: "TimestampWithOffset". Before we start an official voting, we would > > like to discuss our proposal in this thread. > > > > A draft of the format documentation change can be found at [1]. A copy of > > its text is attached under the FORMAT section. > > A draft Go implementation can be found at [2]. > > A draft Rust implementation can be found at [3]. > > > > > > THE PROBLEM > > --- > > "TIMESTAMP WITH TIME ZONE" is a standard ANSI SQL type that associates a > > timezone offset to each timestamp entry in a database. Many database > > systems support this data type (some use aliases). Some examples are > > Snowflake [4], Trino [5], Oracle DB [6] and MS SQL Server [7]. > > > > > > The current set of Arrow types can only keep one timezone that applies to > > the entire column. This limits the expressiveness of data when > interacting > > with such SQL databases. Consumer systems currently need to either > convert > > from source "(timestamp, timezone_offset)" to a normalized arrow UTC > > timestamp and throw away the original time zone information, or use > bespoke > > formats if the time zone needs to be preserved. For example, the ADBC > > implementation for Snowflake currently implements the former [8]. > > > > Dropping the time zone has correctness implications for some > applications. > > Consider a global business which performs monthly reporting. Each > business > > unit is located in a different continent, and processes a number of > orders > > every month. Each order is placed in an "orders" fact table, where > > "ordered_at" is a "TIMESTAMP WITH TIME ZONE" column. Now, assume there is > > an order that was placed at 23:00 (11pm) in California (UTC-8), on the > 31st > > of January. A system that simply casts this timestamp to UTC and drops > the > > time zone will think this order was placed in February, which is an > > incorrect assumption given the business' reporting needs of associating > > orders to the business unit where it was processed. There is no way to > > fetch the original time zone of the order, and using the local client > time > > zone does not help either. And so one cannot generate a correct report > > without workarounds to preserve the time zone. > > > > We think Arrow should have a canonical way representation for this. > > > > > > THE PROPOSAL > > --- > > We propose "TimestampWithOffset" to be a new canonical extension type > that > > stores timestamps as "struct(timestamp=timestamp[time_unit=any, > > timezone=utc], offset_minutes=int16)", such that timestamps can have a > > per-row timezone offset instead of having one global timezone attached to > > the entire column. > > > > > > CONSIDERATIONS AND LIMITATIONS > > --- > > 1. Why use a 16-bit integer offset in minutes? > > In ANSI SQL, the time zone information is defined in terms of an > > "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to > > "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest > > granularity with which you can represent a time zone offset, and the > > maximum minutes in the offset is 13*60=780, we believe it makes sense for > > the offset to be stored as a 16-bit integer in minutes. Nonetheless, > > 16-bits is large enough to fit a much wider offset > > > > It is important to point out that some systems such as MS SQL Server do > > implement data types that can represent offsets with sub-minute > > granularity. We believe representing sub-minute granularity is out of > scope > > for this proposal given that no current or past time zone standards have > > ever specified sub-minute offsets [9], and that is what we're trying to > > solve for. Furthermore, representing the offset in seconds rather than > > minutes would mean the maximum offset is 13*60*60=46800, which is greater > > than the maximum positive integer an int16 can represent (32768), and > thus > > the offset type would need to be wider (int32). > > > > > > 2. This type can still be "lossy" > > Systems like Trino and Oracle DB store the time zone information as an > > IANA time zone name, not as an "HOUR TO MINUTE" interval as specified by > > the ANSI SQL standard. This means the source system (or the arrow > > compatibility layer, such as ADBC) needs to cast time zone strings to the > > offset in minutes. In other words, the integer offset is calculated at > the > > source, not at the consumer. > > > > This means that the consumer cannot render the original IANA time zone > > string, and needs to use something like "UTC-03:00" instead of > > "America/Sao_Paulo", for example. It's impossible for the consumer to > > lookup the IANA time zone with only the offset, as that is a one-to-many > > mapping, and so the type is lossy with respect to the original IANA time > > zone. > > > > There is an upside to this, which is reducing consumer complexity. It > only > > needs to add an offset in minutes to the UTC date to get the original > date > > in its time zone, without any access to the IANA time zone database nor > > performing complicated conversions, like reasoning about variable-offset > > time zones (e.g daylight savings). > > > > 3. JSON representation > > We propose that the de/serialization to/from JSON must use RFC3339 > strings > > [10], without loss of information. RFC3339 is a widely accepted format > > across programming languages and databases, and we argue encoding > > "TimestampWithOffset" with it would make JSON integration with external > > non-Arrow systems easier. It enables JSON consumers to decode to their > own > > timezone-aware representation of timestamps (like Go's "time.Time" or > > JavaScript's "Date") by leveraging existing RFC3339 de/encoders without > > having to implement wrapper boilerplate just for Arrow. > > > > > > RELEVANT LINKS > > --- > > [1] Format specification pull request. > > https://github.com/apache/arrow/pull/48002 > > [2] Golang implementation draft. > > https://github.com/apache/arrow-go/pull/558 > > [3] Rust implementation draft. > > https://github.com/apache/arrow-rs/pull/8743 > > [4] Snowflake's TIMESTAMP_TZ. > > > https://docs.snowflake.com/en/sql-reference/data-types-datetime#timestamp-ltz-timestamp-ntz-timestamp-tz > > [5] Trino's TIMESTAMP WITH TIME ZONE. > > > https://trino.io/docs/current/language/types.html#timestamp-p-with-time-zone > > [6] Oracle's TIMESTAMP WITH TIME ZONE. > > > https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/datetime-data-types-and-time-zone-support.html > > [7] MS SQL Server's DATETIMEOFFSET. > > > https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetimeoffset-transact-sql?view=sql-server-ver17 > > [8] ADBC for Snowflake converts to UTC and drops time zone. > > > https://github.com/apache/arrow-adbc/blob/a67ab5a509676feaec8e24dba479d4de8dc083e2/go/adbc/driver/snowflake/record_reader.go#L228 > > [9] Current time zones in effect. > > https://en.wikipedia.org/wiki/List_of_tz_database_time_zones > > [10] RFC3339 representation of timezone-aware timestamps. > > https://datatracker.ietf.org/doc/html/rfc3339 > > > > > > FORMAT > > --- > > Timestamp With Offset > > ============= > > This type represents a timestamp column that stores potentially different > > timezone offsets per value. The timestamp is stored in UTC alongside the > > original timezone offset in minutes. > > > > * Extension name: ``arrow.timestamp_with_offset``. > > > > * The storage type of the extension is a ``Struct`` with 2 fields, in > > order: > > > > * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where > > ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). > > > > * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) > > representing the offset in minutes from the UTC timezone. Negative > offsets > > represent time zones west of UTC, while positive offsets represent east. > > Offsets range from -779 (-12:59) to +780 (+13:00). > > > > * Extension type parameters: > > > > * ``time_unit``: the time-unit of each of the stored UTC timestamps. > > > > * Description of the serialization: > > > > Extension metadata is an empty string. > > > > When de/serializing to/from JSON, this type must be represented as an > > RFC3339 string, respecting the ``TimeUnit`` precision and time zone > offset > > without loss of information. For example ``2025-01-01T00:00:00Z`` > > represents January 1st 2025 in UTC with second precision, and > > ``2025-01-01T00:00:00.000000001-07:00`` represents one nanosecond after > > January 1st 2025 in UTC-07. > > >
