Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Rok Mihevc Mon, 25 May 2026 09:10:23 -0700

Current proposal has a "closed" parameter with {left, right, both,
neither}. How about we extend that to {left, right, both, neither,
left_per_value, right_per_value, both_per_value}?
This is similar to proposal 3.
Should we specify closedness information per value as optional? Could null
closedness have some meaning?


Rok

On Mon, May 25, 2026 at 4:55 PM Hoeze <[email protected]> wrote:

> Yes, you're right, the current proposal would probably not be
> sufficient for continuous PostgreSQL ranges.
>
> Column level boundary flags were intentional as it allows to check
> closedness in the schema instead of during runtime. This is also how
> Pandas' `IntervalArray`/`IntervalIndex` works.
>
> PostgreSQL's built-in discrete ranges (`int4range`, `int8range`,
> `daterange`) canonicalize to left-closed intervals; here my proposal
> would be sufficient. However, continuous ranges (`numrange`,
> `tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case
> my proposal would indeed not be flexible enough.
>
> I could imagine a number of possible solutions to this shortcoming:
>
>    * Union type of all four closedness versions:
>      Possible but not very elegant. Would shift the implementation
>      burden towards the applications, that have to support union
>      types.
>
>    * Create a separate canonical data type for per-value boundary
>      flags:
>      Storage type `Struct<lower: T, upper: T, lower_inc: bool,
>      upper_inc: bool>`, mirroring PostgreSQL's internal
>      representation. Both types would coexist: `arrow.range` for the
>      uniform case (and for canonicalized discrete PostgreSQL ranges),
>      and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges.
>
>    * Extend `arrow.range` itself with a per-value mode:
>      Keep a single extension type, but allow
>      `{"closed": "per_value"}` in the metadata, in which case the
>      storage struct gains two boolean fields `lower_inc` and
>      `upper_inc`. One extension name, two storage layouts. Simpler
>      from a type-registry standpoint, slightly more conditional logic
>      in implementations.
>
>    * Always store per-value flags:
>      Drop the metadata key entirely and always use
>      `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`.
>      Two extra bytes per row uncompressed, but highly RLE/dictionary-
>      friendly when uniform (which it usually is). Maximally simple to
>      specify, at the cost of some overhead in the common pandas-style
>      case.
>
> I currently lean towards the second option, as it preserves the
> schema-level check for the common case while still giving continuous,
> per-value closedness ranges a lossless path. Fixed-shape tensor vs.
> variable-shape tensor extension types went the same route. The main
> alternative would be option 3, but a single extension name covering
> two storage layouts ties the layout to a JSON metadata field rather
> than to the type name itself, which is easier for downstream tooling
> to get wrong I believe. What do you think?
>
> Best,
> Hoeze
>
>
> Am 25.05.26 um 05:27 schrieb Curt Hagenlocher:
> >  From what I can tell, this would not be sufficiently flexible to store
> > PostgreSQL range columns for which the boundary flags are per-value and
> not
> > per-column. Is this intentional?
> >
> > On Sun, May 24, 2026 at 4:11 PM Florian R. Hölzlwimmer <
> > [email protected]> wrote:
> >
> >> Hi all,
> >>
> >> Following a suggestion from @rok on GitHub, I'd like to open a
> discussion
> >> about adding a canonical extension type for bounded ranges to Arrow.
> >>
> >> Background
> >> ==========
> >>
> >> So far, Arrow has no canonical way to represent a bounded range (a
> >> mathematical interval with a lower and an upper endpoint), e.g. a
> numeric
> >> range `[0, 10)`, a date range, or a timestamp period. Today such data is
> >> modeled ad hoc with two separate columns or with system-specific
> extension
> >> types, which hurts interoperability. A canonical range type will be
> useful
> >> to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database
> >> connectors, ...
> >>
> >> This is distinct from Arrow's existing calendar `Interval` type
> >> (`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`),
> >> which represents a duration (a signed amount of time), not a bounded
> set.
> >> Databases like PostgreSQL make the same distinction: SQL uses `INTERVAL`
> >> for durations and `RANGE` / `PERIOD` for bounded sets. This proposal
> >> follows that convention by naming the type `arrow.range`.
> >>
> >>
> >> Proposed design
> >> ===============
> >>
> >>     * Extension name: `arrow.range`.
> >>
> >>     * Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is
> >> nullable, a null bound represents an unbounded (infinite) endpoint.
> >>
> >>         * Field names `lower` / `upper` follow PostgreSQL's convention
> for
> >> ordering clarity (Pandas uses `left` / `right`).
> >>         * The subtype `T` may be any orderable Arrow type (the numeric,
> >> temporal and decimal families, etc.). Nested or non-comparable types are
> >> out of scope.
> >>
> >>     * Metadata: a JSON object `{"closed": "..."}` where `closed` is one
> of
> >> `left`, `right`, `both`, `neither` (pandas vocabulary; `left` = lower
> >> inclusive / upper exclusive, etc.). Required on the wire so a serialized
> >> `arrow.range` is always unambiguous. Unknown JSON keys are ignored for
> >> forward compatibility.
> >>
> >>     * A range is empty implicitly when `lower > upper`, or when `lower
> ==
> >> upper` with at least one bound exclusive. A range with `lower > upper`
> is
> >> therefore valid (it denotes the empty set), not an error.
> >>
> >> This mirrors pandas' interval support closely enough that `arrow.range`
> >> would give `pandas.IntervalArray` / `IntervalIndex` a natural, lossless
> >> Arrow representation for round-tripping.
> >>
> >>
> >> References
> >> ==========
> >>
> >>     - Full proposal and rationale:
> >>       https://github.com/apache/arrow/issues/50027
> >>     - Draft C++/Python implementation:
> >>       https://github.com/apache/arrow/pull/50028
> >>
> >> I'd appreciate any feedback on the overall direction and on the specific
> >> design choices: field naming (`lower`/`upper` vs. `left`/`right`), the
> >> `closed` parameter, the treatment of unbounded endpoints via
> nullability,
> >> and the set of supported subtypes.
> >>
> >> Many thanks,
> >> Hoeze
> >>
>

Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Reply via email to