pitrou commented on code in PR #48002: URL: https://github.com/apache/arrow/pull/48002#discussion_r2546650112
########## docs/source/format/CanonicalExtensions.rst: ########## @@ -544,6 +544,49 @@ Primitive Type Mappings | UUID extension type | UUID | +----------------------+------------------------+ +.. _timestamp_with_offset_extension: + +Timestamp With Offset +============= +This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. +This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines. + +* Extension name: ``arrow.timestamp_with_offset``. + +* The storage type of the extension is a ``Struct`` with 2 fields, in order: + + * ``timestamp``: a preferably non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). + + * ``offset_minutes``: a preferably non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). + +* Extension type parameters: + + * ``time_unit``: the time-unit of each of the stored UTC timestamps. + +* Description of the serialization: + + Extension metadata is an empty string. + +.. note:: + + It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded. + +.. note:: + + It is also *permissible* for ``timestamp`` and ``offset_minutes`` to be nullable, but not recommended. + + If ``timestamp`` is nullable and a value is found to be null, then the whole ``TimestampWithOffset`` value should be interpreted as null. One way of achieving this is to drop ``timestamp``'s validity buffer (V1) and replace the top-level struct's validity buffer (V2) with the result of ``V1 AND V2``. + + If ``offset_minutes`` is nullable and a value is found to be null, then this value should be interpreted as if the offset value were were zero. Review Comment: > And return an error on instantiation? Which instantiation are we talking about exactly? The `ExtensionArray` subclass in Arrow C++? > I think we can allow nulls on the `timestamp` field and on instantiation put it on the struct array also. > > Would you be fine with both points above? We can, but that places a burden on the consumer for the sake of making things easier for the producer. I'm not sure that's a good tradeoff to make. > I think many operations will care only about the timestamp field and having the validity buffer there helps them simply delegate to functions on the `Timestamp` type. At the same time, many function only look at the top-level validity buffer (i.e. the validity of the struct). So during instantiation it would make sense to share the validity buffer if there is one. I admit that can make sense. Let's hope for other opinions on the topic. @paleolimbot @alamb @tustvold -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
