Likewise, I took a look and the proposal also seems quite reasonable. I
encourage others to take a look too

On Thu, Nov 20, 2025 at 10:35 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello all,
>
> This proposal looks fine to me on the principle. I've posted a couple
> comments on the format PR.
>
> Regards
>
> Antoine.
>
>
> Le 30/10/2025 à 22:02, serramatutu a écrit :
> > Hello everyone!
> >
> > We (me and @felipecrv) would like to propose a new canonical extension
> type: "TimestampWithOffset". Before we start an official voting, we would
> like to discuss our proposal in this thread.
> >
> > A draft of the format documentation change can be found at [1]. A copy
> of its text is attached under the FORMAT section.
> > A draft Go implementation can be found at [2].
> > A draft Rust implementation can be found at [3].
> >
> >
> > THE PROBLEM
> > ---
> > "TIMESTAMP WITH TIME ZONE" is a standard ANSI SQL type that associates a
> timezone offset to each timestamp entry in a database. Many database
> systems support this data type (some use aliases). Some examples are
> Snowflake [4], Trino [5], Oracle DB [6] and MS SQL Server [7].
> >
> >
> > The current set of Arrow types can only keep one timezone that applies
> to the entire column. This limits the expressiveness of data when
> interacting with such SQL databases. Consumer systems currently need to
> either convert from source "(timestamp, timezone_offset)" to a normalized
> arrow UTC timestamp and throw away the original time zone information, or
> use bespoke formats if the time zone needs to be preserved. For example,
> the ADBC implementation for Snowflake currently implements the former [8].
> >
> > Dropping the time zone has correctness implications for some
> applications. Consider a global business which performs monthly reporting.
> Each business unit is located in a different continent, and processes a
> number of orders every month. Each order is placed in an "orders" fact
> table, where "ordered_at" is a "TIMESTAMP WITH TIME ZONE" column. Now,
> assume there is an order that was placed at 23:00 (11pm) in California
> (UTC-8), on the 31st of January. A system that simply casts this timestamp
> to UTC and drops the time zone will think this order was placed in
> February, which is an incorrect assumption given the business' reporting
> needs of associating orders to the business unit where it was processed.
> There is no way to fetch the original time zone of the order, and using the
> local client time zone does not help either. And so one cannot generate a
> correct report without workarounds to preserve the time zone.
> >
> > We think Arrow should have a canonical way representation for this.
> >
> >
> > THE PROPOSAL
> > ---
> > We propose "TimestampWithOffset" to be a new canonical extension type
> that stores timestamps as "struct(timestamp=timestamp[time_unit=any,
> timezone=utc], offset_minutes=int16)", such that timestamps can have a
> per-row timezone offset instead of having one global timezone attached to
> the entire column.
> >
> >
> > CONSIDERATIONS AND LIMITATIONS
> > ---
> > 1. Why use a 16-bit integer offset in minutes?
> > In ANSI SQL, the time zone information is defined in terms of an
> "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to
> "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest
> granularity with which you can represent a time zone offset, and the
> maximum minutes in the offset is 13*60=780, we believe it makes sense for
> the offset to be stored as a 16-bit integer in minutes. Nonetheless,
> 16-bits is large enough to fit a much wider offset
> >
> > It is important to point out that some systems such as MS SQL Server do
> implement data types that can represent offsets with sub-minute
> granularity. We believe representing sub-minute granularity is out of scope
> for this proposal given that no current or past time zone standards have
> ever specified sub-minute offsets [9], and that is what we're trying to
> solve for. Furthermore, representing the offset in seconds rather than
> minutes would mean the maximum offset is 13*60*60=46800, which is greater
> than the maximum positive integer an int16 can represent (32768), and thus
> the offset type would need to be wider (int32).
> >
> >
> > 2. This type can still be "lossy"
> > Systems like Trino and Oracle DB store the time zone information as an
> IANA time zone name, not as an "HOUR TO MINUTE" interval as specified by
> the ANSI SQL standard. This means the source system (or the arrow
> compatibility layer, such as ADBC) needs to cast time zone strings to the
> offset in minutes. In other words, the integer offset is calculated at the
> source, not at the consumer.
> >
> > This means that the consumer cannot render the original IANA time zone
> string, and needs to use something like "UTC-03:00" instead of
> "America/Sao_Paulo", for example. It's impossible for the consumer to
> lookup the IANA time zone with only the offset, as that is a one-to-many
> mapping, and so the type is lossy with respect to the original IANA time
> zone.
> >
> > There is an upside to this, which is reducing consumer complexity. It
> only needs to add an offset in minutes to the UTC date to get the original
> date in its time zone, without any access to the IANA time zone database
> nor performing complicated conversions, like reasoning about
> variable-offset time zones (e.g daylight savings).
> >
> > 3. JSON representation
> > We propose that the de/serialization to/from JSON must use RFC3339
> strings [10], without loss of information. RFC3339 is a widely accepted
> format across programming languages and databases, and we argue encoding
> "TimestampWithOffset" with it would make JSON integration with external
> non-Arrow systems easier. It enables JSON consumers to decode to their own
> timezone-aware representation of timestamps (like Go's "time.Time" or
> JavaScript's "Date") by leveraging existing RFC3339 de/encoders without
> having to implement wrapper boilerplate just for Arrow.
> >
> >
> > RELEVANT LINKS
> > ---
> > [1] Format specification pull request.
> https://github.com/apache/arrow/pull/48002
> > [2] Golang implementation draft.
> https://github.com/apache/arrow-go/pull/558
> > [3] Rust implementation draft.
> https://github.com/apache/arrow-rs/pull/8743
> > [4] Snowflake's TIMESTAMP_TZ.
> https://docs.snowflake.com/en/sql-reference/data-types-datetime#timestamp-ltz-timestamp-ntz-timestamp-tz
> > [5] Trino's TIMESTAMP WITH TIME ZONE.
> https://trino.io/docs/current/language/types.html#timestamp-p-with-time-zone
> > [6] Oracle's TIMESTAMP WITH TIME ZONE.
> https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/datetime-data-types-and-time-zone-support.html
> > [7] MS SQL Server's DATETIMEOFFSET.
> https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetimeoffset-transact-sql?view=sql-server-ver17
> > [8] ADBC for Snowflake converts to UTC and drops time zone.
> https://github.com/apache/arrow-adbc/blob/a67ab5a509676feaec8e24dba479d4de8dc083e2/go/adbc/driver/snowflake/record_reader.go#L228
> > [9] Current time zones in effect.
> https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
> > [10] RFC3339 representation of timezone-aware timestamps.
> https://datatracker.ietf.org/doc/html/rfc3339
> >
> >
> > FORMAT
> > ---
> > Timestamp With Offset
> > =============
> > This type represents a timestamp column that stores potentially
> different timezone offsets per value. The timestamp is stored in UTC
> alongside the original timezone offset in minutes.
> >
> > * Extension name: ``arrow.timestamp_with_offset``.
> >
> > * The storage type of the extension is a ``Struct`` with 2 fields, in
> order:
> >
> >    * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``,
> where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
> >
> >    * ``offset_minutes``: a non-nullable signed 16-bit integer
> (``Int16``) representing the offset in minutes from the UTC timezone.
> Negative offsets represent time zones west of UTC, while positive offsets
> represent east. Offsets range from -779 (-12:59) to +780 (+13:00).
> >
> > * Extension type parameters:
> >
> >    * ``time_unit``: the time-unit of each of the stored UTC timestamps.
> >
> > * Description of the serialization:
> >
> >    Extension metadata is an empty string.
> >
> >    When de/serializing to/from JSON, this type must be represented as an
> RFC3339 string, respecting the ``TimeUnit`` precision and time zone offset
> without loss of information. For example ``2025-01-01T00:00:00Z``
> represents January 1st 2025 in UTC with second precision, and
> ``2025-01-01T00:00:00.000000001-07:00`` represents one nanosecond after
> January 1st 2025 in UTC-07.
>
>

Reply via email to