I feel like you should have different types depending on whether a bound is open or not.
[.., ..) arrow.half_open_range (.., ..) arrow.closed_range (.., ..] arrow.lower_half_open_range [.., ..] arrow.open_range Routing compute functions based on metadata would be hellish and would force compute function writers to care about all possible values of the metadata when a new function on ranges is introduced. With the different types, you could have a canonicalization layer that allocates a new array before a function learns to handle all possible range types. Example: a contains() implementation that computes directly only arrow.half_open_range arrays, but when that is not the input, it calls a utility that can canonicalize every range type to arrow.half_open_range. Making the openness per value will complicate columnar compute functions, even though it simplifies transport from databases like PostgreSQL. You would not need to canonicalize the output from a PostgreSQL query into one of the range types in the ADBC driver, for instance. At the same time, expecting application consumers of an ADBC driver to check the open/closed flags per value is a big ask. -- Felipe On Sun, Jun 7, 2026 at 11:23 AM Hoeze <[email protected]> wrote: > Dear all, > > What is your opinion on naming the range types with a fixed closedness and > a per-value closedness? > Currently, they are called `RangeType` and `RangeIncType`, which Rok > considers poor naming (and I agree). > > Some ideas: > > 1. RangeType | VarRangeType > 2. RangeType | PerValueRangeType > 3. RangeType | GranularRangeType > 4. FixedClosednessRangeType | VariableClosednessRangeType > > My personal favorite is (4), as it mirrors the tensor type naming and > points out the difference. > However, it sounds quite long and clumpy. > > What do you think? > > Best, > Hoeze > > On 04/06/2026 22:30, Hoeze wrote: > > Thank you for your feedback, Antoine. > > > > I updated my draft PR and added the 'range_inc' type: > > https://github.com/apache/arrow/pull/50028/ > > > > Please let me know if you have any further suggestions :) > > > > Best, > > Hoeze > > > > Am 02.06.26 um 19:00 schrieb Antoine Pitrou: > >> > >> Le 25/05/2026 à 16:54, Hoeze a écrit : > >>> Yes, you're right, the current proposal would probably not be > >>> sufficient for continuous PostgreSQL ranges. > >>> > >>> Column level boundary flags were intentional as it allows to check > >>> closedness in the schema instead of during runtime. This is also how > >>> Pandas' `IntervalArray`/`IntervalIndex` works. > >>> > >>> PostgreSQL's built-in discrete ranges (`int4range`, `int8range`, > >>> `daterange`) canonicalize to left-closed intervals; here my proposal > >>> would be sufficient. However, continuous ranges (`numrange`, > >>> `tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case > >>> my proposal would indeed not be flexible enough. > >>> > >>> I could imagine a number of possible solutions to this shortcoming: > >>> > >>> * Union type of all four closedness versions: > >>> Possible but not very elegant. Would shift the implementation > >>> burden towards the applications, that have to support union > >>> types. > >>> > >>> * Create a separate canonical data type for per-value boundary > >>> flags: > >>> Storage type `Struct<lower: T, upper: T, lower_inc: bool, > >>> upper_inc: bool>`, mirroring PostgreSQL's internal > >>> representation. Both types would coexist: `arrow.range` for the > >>> uniform case (and for canonicalized discrete PostgreSQL ranges), > >>> and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges. > >>> > >>> * Extend `arrow.range` itself with a per-value mode: > >>> Keep a single extension type, but allow > >>> `{"closed": "per_value"}` in the metadata, in which case the > >>> storage struct gains two boolean fields `lower_inc` and > >>> `upper_inc`. One extension name, two storage layouts. Simpler > >>> from a type-registry standpoint, slightly more conditional logic > >>> in implementations. > >>> > >>> * Always store per-value flags: > >>> Drop the metadata key entirely and always use > >>> `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`. > >>> Two extra bytes per row uncompressed, but highly RLE/dictionary- > >>> friendly when uniform (which it usually is). Maximally simple to > >>> specify, at the cost of some overhead in the common pandas-style > >>> case. > >>> > >>> I currently lean towards the second option, as it preserves the > >>> schema-level check for the common case while still giving continuous, > >>> per-value closedness ranges a lossless path. Fixed-shape tensor vs. > >>> variable-shape tensor extension types went the same route. The main > >>> alternative would be option 3, but a single extension name covering > >>> two storage layouts ties the layout to a JSON metadata field rather > >>> than to the type name itself, which is easier for downstream tooling > >>> to get wrong I believe. What do you think? > >> > >> I agree that option 2 sounds best, the tensor analogy is spot-on. > >> > >> Regards > >> > >> Antoine. > >>
