Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Felipe Oliveira Carvalho Mon, 08 Jun 2026 10:26:04 -0700

I feel like you should have different types depending on whether a bound is
open or not.


[.., ..) arrow.half_open_range
(.., ..) arrow.closed_range
(.., ..] arrow.lower_half_open_range
[.., ..] arrow.open_range

Routing compute functions based on metadata would be hellish and would
force compute function writers to care about all possible values of the
metadata when a new function on ranges is introduced. With the different
types, you could have a canonicalization layer that allocates a new array
before a function learns to handle all possible range types.

Example: a contains() implementation that computes directly only
arrow.half_open_range arrays, but when that is not the input, it calls a
utility that can canonicalize every range type to arrow.half_open_range.

Making the openness per value will complicate columnar compute functions,
even though it simplifies transport from databases like PostgreSQL. You
would not need to canonicalize the output from a PostgreSQL query into one
of the range types in the ADBC driver, for instance. At the same time,
expecting application consumers of an ADBC driver to check the open/closed
flags per value is a big ask.

--
Felipe

On Sun, Jun 7, 2026 at 11:23 AM Hoeze <[email protected]> wrote:

> Dear all,
>
> What is your opinion on naming the range types with a fixed closedness and
> a per-value closedness?
> Currently, they are called `RangeType` and `RangeIncType`, which Rok
> considers poor naming (and I agree).
>
> Some ideas:
>
>  1. RangeType | VarRangeType
>  2. RangeType | PerValueRangeType
>  3. RangeType | GranularRangeType
>  4. FixedClosednessRangeType | VariableClosednessRangeType
>
> My personal favorite is (4), as it mirrors the tensor type naming and
> points out the difference.
> However, it sounds quite long and clumpy.
>
> What do you think?
>
> Best,
> Hoeze
>
> On 04/06/2026 22:30, Hoeze wrote:
> > Thank you for your feedback, Antoine.
> >
> > I updated my draft PR and added the 'range_inc' type:
> > https://github.com/apache/arrow/pull/50028/
> >
> > Please let me know if you have any further suggestions :)
> >
> > Best,
> > Hoeze
> >
> > Am 02.06.26 um 19:00 schrieb Antoine Pitrou:
> >>
> >> Le 25/05/2026 à 16:54, Hoeze a écrit :
> >>> Yes, you're right, the current proposal would probably not be
> >>> sufficient for continuous PostgreSQL ranges.
> >>>
> >>> Column level boundary flags were intentional as it allows to check
> >>> closedness in the schema instead of during runtime. This is also how
> >>> Pandas' `IntervalArray`/`IntervalIndex` works.
> >>>
> >>> PostgreSQL's built-in discrete ranges (`int4range`, `int8range`,
> >>> `daterange`) canonicalize to left-closed intervals; here my proposal
> >>> would be sufficient. However, continuous ranges (`numrange`,
> >>> `tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case
> >>> my proposal would indeed not be flexible enough.
> >>>
> >>> I could imagine a number of possible solutions to this shortcoming:
> >>>
> >>>     * Union type of all four closedness versions:
> >>>       Possible but not very elegant. Would shift the implementation
> >>>       burden towards the applications, that have to support union
> >>>       types.
> >>>
> >>>     * Create a separate canonical data type for per-value boundary
> >>>       flags:
> >>>       Storage type `Struct<lower: T, upper: T, lower_inc: bool,
> >>>       upper_inc: bool>`, mirroring PostgreSQL's internal
> >>>       representation. Both types would coexist: `arrow.range` for the
> >>>       uniform case (and for canonicalized discrete PostgreSQL ranges),
> >>>       and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges.
> >>>
> >>>     * Extend `arrow.range` itself with a per-value mode:
> >>>       Keep a single extension type, but allow
> >>>       `{"closed": "per_value"}` in the metadata, in which case the
> >>>       storage struct gains two boolean fields `lower_inc` and
> >>>       `upper_inc`. One extension name, two storage layouts. Simpler
> >>>       from a type-registry standpoint, slightly more conditional logic
> >>>       in implementations.
> >>>
> >>>     * Always store per-value flags:
> >>>       Drop the metadata key entirely and always use
> >>>       `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`.
> >>>       Two extra bytes per row uncompressed, but highly RLE/dictionary-
> >>>       friendly when uniform (which it usually is). Maximally simple to
> >>>       specify, at the cost of some overhead in the common pandas-style
> >>>       case.
> >>>
> >>> I currently lean towards the second option, as it preserves the
> >>> schema-level check for the common case while still giving continuous,
> >>> per-value closedness ranges a lossless path. Fixed-shape tensor vs.
> >>> variable-shape tensor extension types went the same route. The main
> >>> alternative would be option 3, but a single extension name covering
> >>> two storage layouts ties the layout to a JSON metadata field rather
> >>> than to the type name itself, which is easier for downstream tooling
> >>> to get wrong I believe. What do you think?
> >>
> >> I agree that option 2 sounds best, the tensor analogy is spot-on.
> >>
> >> Regards
> >>
> >> Antoine.
> >>

Re: [DISCUSS] Add arrow.range canonical extension type for bounded ranges

Reply via email to