Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Antoine Pitrou Tue, 02 Jun 2026 10:02:42 -0700


Le 25/05/2026 à 16:54, Hoeze a écrit :

Yes, you're right, the current proposal would probably not be
sufficient for continuous PostgreSQL ranges.

Column level boundary flags were intentional as it allows to check
closedness in the schema instead of during runtime. This is also how
Pandas' `IntervalArray`/`IntervalIndex` works.

PostgreSQL's built-in discrete ranges (`int4range`, `int8range`,
`daterange`) canonicalize to left-closed intervals; here my proposal
would be sufficient. However, continuous ranges (`numrange`,
`tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case
my proposal would indeed not be flexible enough.

I could imagine a number of possible solutions to this shortcoming:

    * Union type of all four closedness versions:
      Possible but not very elegant. Would shift the implementation
      burden towards the applications, that have to support union
      types.

    * Create a separate canonical data type for per-value boundary
      flags:
      Storage type `Struct<lower: T, upper: T, lower_inc: bool,
      upper_inc: bool>`, mirroring PostgreSQL's internal
      representation. Both types would coexist: `arrow.range` for the
      uniform case (and for canonicalized discrete PostgreSQL ranges),
      and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges.

    * Extend `arrow.range` itself with a per-value mode:
      Keep a single extension type, but allow
      `{"closed": "per_value"}` in the metadata, in which case the
      storage struct gains two boolean fields `lower_inc` and
      `upper_inc`. One extension name, two storage layouts. Simpler
      from a type-registry standpoint, slightly more conditional logic
      in implementations.

    * Always store per-value flags:
      Drop the metadata key entirely and always use
      `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`.
      Two extra bytes per row uncompressed, but highly RLE/dictionary-
      friendly when uniform (which it usually is). Maximally simple to
      specify, at the cost of some overhead in the common pandas-style
      case.

I currently lean towards the second option, as it preserves the
schema-level check for the common case while still giving continuous,
per-value closedness ranges a lossless path. Fixed-shape tensor vs.
variable-shape tensor extension types went the same route. The main
alternative would be option 3, but a single extension name covering
two storage layouts ties the layout to a JSON metadata field rather
than to the type name itself, which is easier for downstream tooling
to get wrong I believe. What do you think?


I agree that option 2 sounds best, the tensor analogy is spot-on.

Regards

Antoine.

Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Reply via email to