Le 25/05/2026 à 16:54, Hoeze a écrit :
Yes, you're right, the current proposal would probably not be
sufficient for continuous PostgreSQL ranges.
Column level boundary flags were intentional as it allows to check
closedness in the schema instead of during runtime. This is also how
Pandas' `IntervalArray`/`IntervalIndex` works.
PostgreSQL's built-in discrete ranges (`int4range`, `int8range`,
`daterange`) canonicalize to left-closed intervals; here my proposal
would be sufficient. However, continuous ranges (`numrange`,
`tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case
my proposal would indeed not be flexible enough.
I could imagine a number of possible solutions to this shortcoming:
* Union type of all four closedness versions:
Possible but not very elegant. Would shift the implementation
burden towards the applications, that have to support union
types.
* Create a separate canonical data type for per-value boundary
flags:
Storage type `Struct<lower: T, upper: T, lower_inc: bool,
upper_inc: bool>`, mirroring PostgreSQL's internal
representation. Both types would coexist: `arrow.range` for the
uniform case (and for canonicalized discrete PostgreSQL ranges),
and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges.
* Extend `arrow.range` itself with a per-value mode:
Keep a single extension type, but allow
`{"closed": "per_value"}` in the metadata, in which case the
storage struct gains two boolean fields `lower_inc` and
`upper_inc`. One extension name, two storage layouts. Simpler
from a type-registry standpoint, slightly more conditional logic
in implementations.
* Always store per-value flags:
Drop the metadata key entirely and always use
`Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`.
Two extra bytes per row uncompressed, but highly RLE/dictionary-
friendly when uniform (which it usually is). Maximally simple to
specify, at the cost of some overhead in the common pandas-style
case.
I currently lean towards the second option, as it preserves the
schema-level check for the common case while still giving continuous,
per-value closedness ranges a lossless path. Fixed-shape tensor vs.
variable-shape tensor extension types went the same route. The main
alternative would be option 3, but a single extension name covering
two storage layouts ties the layout to a JSON metadata field rather
than to the type name itself, which is easier for downstream tooling
to get wrong I believe. What do you think?
I agree that option 2 sounds best, the tensor analogy is spot-on.
Regards
Antoine.