+1! I can't comment on the timestamp-specific details but have had similar
experiences writing database drivers where Arrow types require a type-level
parameter but the database has a row-level parameter (Postgres numerics
were my last adventure with this one).

> Will dictionary encoding of the time zone component be allowed

Unless there are some concrete performance or memory usage experiments that
suggest concrete benefits, I would probably err on the side of not allowing
this or at least not recommending it. Having a single allowed type is quite
nice (GeoArrow might be an extreme case, but we have many allowed types
there and it's a lot to deal with).

> I didn't think we generally specify JSON serialization formats for types?

It's true that we don't generally specify the cast behaviour to and from
string, although most Arrow implementations have to choose how to display a
value at some point and it might be a nice addition to ensure the value is
displayed consistently.

On Wed, Nov 5, 2025 at 1:47 AM serramatutu <[email protected]>
wrote:

> Hey!
>
> 1. I didn't think we generally specify JSON serialization formats for
> types?
>
> We included this in the proposal mostly because we thought it would be
> nicer for non-Arrow programs to parse a list of RFC3339 JSON like
> `["2025-01-01T00:01:00T+01:00"]` instead of something like `[{"timestamp":
> "2025-01-01T00:00:00Z", "offset": 60}]`. I understand if JSON serialization
> is not something we want to put in the spec. I do want to point out that
> the arrow-go implementation of the UUID extension type does have
> specialized JSON serialization [1], so I thought we could do the same for
> TimestampWithOffset as well. I see that the specialized UUID serialization
> is not a part of the UUID type spec though, so I assume it's something the
> Go implementers thought was best.
>
>
> 2. I assume if there is a high enough demand we can always have a
> different variant that uses IANA strings?
>
> Yes, this is something that we can discuss in the future if folks ask for
> it, either through extending this type somehow or adding a new
> `TimestampWithTimezone`. We thought of keeping things simple for now with
> this proposal, though.
>
>
> 3. Will dictionary encoding of the time zone component be allowed (for
> common cases I would expect one could save 1 byte but this is perhaps over
> optimization).
>
> This is a good point. We could add both dictionary encoding and run-end
> encoding to the offsets. I think we can add this to the spec now to avoid
> future breaking changes to the data types, and keep the implementation PRs
> simple (only primitive encoding) for now. We can later add support for
> other encodings in followup PRs.
>
>
> [1]
> https://github.com/apache/arrow-go/blob/f6575665bbf95d92795c37a1a792682d9da69b08/arrow/extensions/uuid_test.go#L242
>
>
> On Tuesday, November 4th, 2025 at 16:18, Micah Kornfield <
> [email protected]> wrote:
>
> >
>
> >
>
> > Generally, seems like a good idea two questions:
> >
>
> > 1. I didn't think we generally specify JSON serialization formats for
> > types?
> > 2. I assume if there is a high enough demand we can always have a
> > different variant that uses IANA strings?
> > 3. Will dictionary encoding of the time zone component be allowed (for
> > common cases I would expect one could save 1 byte but this is perhaps
> over
> > optimization).
> >
>
> > Thanks,
> > Micah
> >
>
> > On Tue, Nov 4, 2025 at 5:25 AM Weston Pace [email protected] wrote:
> >
>
> > > +1
> > >
>
> > > This has always felt like a missing option and hopefully its presence
> will
> > > actually make the intent of the other timestamp types clearer.
> > >
>
> > > On Thu, Oct 30, 2025 at 2:04 PM serramatutu [email protected]
> > > wrote:
> > >
>
> > > > Hello everyone!
> > > >
>
> > > > We (me and @felipecrv) would like to propose a new canonical
> extension
> > > > type: "TimestampWithOffset". Before we start an official voting, we
> would
> > > > like to discuss our proposal in this thread.
> > > >
>
> > > > A draft of the format documentation change can be found at [1]. A
> copy of
> > > > its text is attached under the FORMAT section.
> > > > A draft Go implementation can be found at [2].
> > > > A draft Rust implementation can be found at [3].
> > > >
>
> > > > THE PROBLEM
> > > > ---
> > > > "TIMESTAMP WITH TIME ZONE" is a standard ANSI SQL type that
> associates a
> > > > timezone offset to each timestamp entry in a database. Many database
> > > > systems support this data type (some use aliases). Some examples are
> > > > Snowflake [4], Trino [5], Oracle DB [6] and MS SQL Server [7].
> > > >
>
> > > > The current set of Arrow types can only keep one timezone that
> applies to
> > > > the entire column. This limits the expressiveness of data when
> > > > interacting
> > > > with such SQL databases. Consumer systems currently need to either
> > > > convert
> > > > from source "(timestamp, timezone_offset)" to a normalized arrow UTC
> > > > timestamp and throw away the original time zone information, or use
> > > > bespoke
> > > > formats if the time zone needs to be preserved. For example, the ADBC
> > > > implementation for Snowflake currently implements the former [8].
> > > >
>
> > > > Dropping the time zone has correctness implications for some
> > > > applications.
> > > > Consider a global business which performs monthly reporting. Each
> > > > business
> > > > unit is located in a different continent, and processes a number of
> > > > orders
> > > > every month. Each order is placed in an "orders" fact table, where
> > > > "ordered_at" is a "TIMESTAMP WITH TIME ZONE" column. Now, assume
> there is
> > > > an order that was placed at 23:00 (11pm) in California (UTC-8), on
> the
> > > > 31st
> > > > of January. A system that simply casts this timestamp to UTC and
> drops
> > > > the
> > > > time zone will think this order was placed in February, which is an
> > > > incorrect assumption given the business' reporting needs of
> associating
> > > > orders to the business unit where it was processed. There is no way
> to
> > > > fetch the original time zone of the order, and using the local client
> > > > time
> > > > zone does not help either. And so one cannot generate a correct
> report
> > > > without workarounds to preserve the time zone.
> > > >
>
> > > > We think Arrow should have a canonical way representation for this.
> > > >
>
> > > > THE PROPOSAL
> > > > ---
> > > > We propose "TimestampWithOffset" to be a new canonical extension type
> > > > that
> > > > stores timestamps as "struct(timestamp=timestamp[time_unit=any,
> > > > timezone=utc], offset_minutes=int16)", such that timestamps can have
> a
> > > > per-row timezone offset instead of having one global timezone
> attached to
> > > > the entire column.
> > > >
>
> > > > CONSIDERATIONS AND LIMITATIONS
> > > > ---
> > > > 1. Why use a 16-bit integer offset in minutes?
> > > > In ANSI SQL, the time zone information is defined in terms of an
> > > > "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to
> > > > "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest
> > > > granularity with which you can represent a time zone offset, and the
> > > > maximum minutes in the offset is 13*60=780, we believe it makes
> sense for
> > > > the offset to be stored as a 16-bit integer in minutes. Nonetheless,
> > > > 16-bits is large enough to fit a much wider offset
> > > >
>
> > > > It is important to point out that some systems such as MS SQL Server
> do
> > > > implement data types that can represent offsets with sub-minute
> > > > granularity. We believe representing sub-minute granularity is out of
> > > > scope
> > > > for this proposal given that no current or past time zone standards
> have
> > > > ever specified sub-minute offsets [9], and that is what we're trying
> to
> > > > solve for. Furthermore, representing the offset in seconds rather
> than
> > > > minutes would mean the maximum offset is 136060=46800, which is
> greater
> > > > than the maximum positive integer an int16 can represent (32768), and
> > > > thus
> > > > the offset type would need to be wider (int32).
> > > >
>
> > > > 2. This type can still be "lossy"
> > > > Systems like Trino and Oracle DB store the time zone information as
> an
> > > > IANA time zone name, not as an "HOUR TO MINUTE" interval as
> specified by
> > > > the ANSI SQL standard. This means the source system (or the arrow
> > > > compatibility layer, such as ADBC) needs to cast time zone strings
> to the
> > > > offset in minutes. In other words, the integer offset is calculated
> at
> > > > the
> > > > source, not at the consumer.
> > > >
>
> > > > This means that the consumer cannot render the original IANA time
> zone
> > > > string, and needs to use something like "UTC-03:00" instead of
> > > > "America/Sao_Paulo", for example. It's impossible for the consumer to
> > > > lookup the IANA time zone with only the offset, as that is a
> one-to-many
> > > > mapping, and so the type is lossy with respect to the original IANA
> time
> > > > zone.
> > > >
>
> > > > There is an upside to this, which is reducing consumer complexity. It
> > > > only
> > > > needs to add an offset in minutes to the UTC date to get the original
> > > > date
> > > > in its time zone, without any access to the IANA time zone database
> nor
> > > > performing complicated conversions, like reasoning about
> variable-offset
> > > > time zones (e.g daylight savings).
> > > >
>
> > > > 3. JSON representation
> > > > We propose that the de/serialization to/from JSON must use RFC3339
> > > > strings
> > > > [10], without loss of information. RFC3339 is a widely accepted
> format
> > > > across programming languages and databases, and we argue encoding
> > > > "TimestampWithOffset" with it would make JSON integration with
> external
> > > > non-Arrow systems easier. It enables JSON consumers to decode to
> their
> > > > own
> > > > timezone-aware representation of timestamps (like Go's "time.Time" or
> > > > JavaScript's "Date") by leveraging existing RFC3339 de/encoders
> without
> > > > having to implement wrapper boilerplate just for Arrow.
> > > >
>
> > > > RELEVANT LINKS
> > > > ---
> > > > [1] Format specification pull request.
> > > > https://github.com/apache/arrow/pull/48002
> > > > [2] Golang implementation draft.
> > > > https://github.com/apache/arrow-go/pull/558
> > > > [3] Rust implementation draft.
> > > > https://github.com/apache/arrow-rs/pull/8743
> > > > [4] Snowflake's TIMESTAMP_TZ.
> > >
>
> > >
> https://docs.snowflake.com/en/sql-reference/data-types-datetime#timestamp-ltz-timestamp-ntz-timestamp-tz
> > >
>
> > > > [5] Trino's TIMESTAMP WITH TIME ZONE.
> > >
>
> > >
> https://trino.io/docs/current/language/types.html#timestamp-p-with-time-zone
> > >
>
> > > > [6] Oracle's TIMESTAMP WITH TIME ZONE.
> > >
>
> > >
> https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/datetime-data-types-and-time-zone-support.html
> > >
>
> > > > [7] MS SQL Server's DATETIMEOFFSET.
> > >
>
> > >
> https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetimeoffset-transact-sql?view=sql-server-ver17
> > >
>
> > > > [8] ADBC for Snowflake converts to UTC and drops time zone.
> > >
>
> > >
> https://github.com/apache/arrow-adbc/blob/a67ab5a509676feaec8e24dba479d4de8dc083e2/go/adbc/driver/snowflake/record_reader.go#L228
> > >
>
> > > > [9] Current time zones in effect.
> > > > https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
> > > > [10] RFC3339 representation of timezone-aware timestamps.
> > > > https://datatracker.ietf.org/doc/html/rfc3339
> > > >
>
> > > > FORMAT
> > > > ---
> > > > Timestamp With Offset
> > > > =============
> > > > This type represents a timestamp column that stores potentially
> different
> > > > timezone offsets per value. The timestamp is stored in UTC alongside
> the
> > > > original timezone offset in minutes.
> > > >
>
> > > > * Extension name: `arrow.timestamp_with_offset`.
> > > >
>
> > > > * The storage type of the extension is a `Struct` with 2 fields, in
> > > > order:
> > > >
>
> > > > * `timestamp`: a non-nullable `Timestamp(time_unit, "UTC")`, where
> > > > `time_unit` is any Arrow `TimeUnit` (s, ms, us or ns).
> > > >
>
> > > > * `offset_minutes`: a non-nullable signed 16-bit integer (`Int16`)
> > > > representing the offset in minutes from the UTC timezone. Negative
> > > > offsets
> > > > represent time zones west of UTC, while positive offsets represent
> east.
> > > > Offsets range from -779 (-12:59) to +780 (+13:00).
> > > >
>
> > > > * Extension type parameters:
> > > >
>
> > > > * `time_unit`: the time-unit of each of the stored UTC timestamps.
> > > >
>
> > > > * Description of the serialization:
> > > >
>
> > > > Extension metadata is an empty string.
> > > >
>
> > > > When de/serializing to/from JSON, this type must be represented as an
> > > > RFC3339 string, respecting the `TimeUnit` precision and time zone
> > > > offset
> > > > without loss of information. For example `2025-01-01T00:00:00Z`
> > > > represents January 1st 2025 in UTC with second precision, and
> > > > `2025-01-01T00:00:00.000000001-07:00` represents one nanosecond after
> > > > January 1st 2025 in UTC-07.

Reply via email to