Current proposal has a "closed" parameter with {left, right, both,
neither}. How about we extend that to {left, right, both, neither,
left_per_value, right_per_value, both_per_value}?
This is similar to proposal 3.
Should we specify closedness information per value as optional? Could null
closedness have some meaning?Rok On Mon, May 25, 2026 at 4:55 PM Hoeze <[email protected]> wrote: > Yes, you're right, the current proposal would probably not be > sufficient for continuous PostgreSQL ranges. > > Column level boundary flags were intentional as it allows to check > closedness in the schema instead of during runtime. This is also how > Pandas' `IntervalArray`/`IntervalIndex` works. > > PostgreSQL's built-in discrete ranges (`int4range`, `int8range`, > `daterange`) canonicalize to left-closed intervals; here my proposal > would be sufficient. However, continuous ranges (`numrange`, > `tsrange`, `tstzrange`, ...) cannot be canonicalized. In this case > my proposal would indeed not be flexible enough. > > I could imagine a number of possible solutions to this shortcoming: > > * Union type of all four closedness versions: > Possible but not very elegant. Would shift the implementation > burden towards the applications, that have to support union > types. > > * Create a separate canonical data type for per-value boundary > flags: > Storage type `Struct<lower: T, upper: T, lower_inc: bool, > upper_inc: bool>`, mirroring PostgreSQL's internal > representation. Both types would coexist: `arrow.range` for the > uniform case (and for canonicalized discrete PostgreSQL ranges), > and e.g. `arrow.range_inc` for continuous (PostgreSQL) ranges. > > * Extend `arrow.range` itself with a per-value mode: > Keep a single extension type, but allow > `{"closed": "per_value"}` in the metadata, in which case the > storage struct gains two boolean fields `lower_inc` and > `upper_inc`. One extension name, two storage layouts. Simpler > from a type-registry standpoint, slightly more conditional logic > in implementations. > > * Always store per-value flags: > Drop the metadata key entirely and always use > `Struct<lower: T, upper: T, lower_inc: bool, upper_inc: bool>`. > Two extra bytes per row uncompressed, but highly RLE/dictionary- > friendly when uniform (which it usually is). Maximally simple to > specify, at the cost of some overhead in the common pandas-style > case. > > I currently lean towards the second option, as it preserves the > schema-level check for the common case while still giving continuous, > per-value closedness ranges a lossless path. Fixed-shape tensor vs. > variable-shape tensor extension types went the same route. The main > alternative would be option 3, but a single extension name covering > two storage layouts ties the layout to a JSON metadata field rather > than to the type name itself, which is easier for downstream tooling > to get wrong I believe. What do you think? > > Best, > Hoeze > > > Am 25.05.26 um 05:27 schrieb Curt Hagenlocher: > > From what I can tell, this would not be sufficiently flexible to store > > PostgreSQL range columns for which the boundary flags are per-value and > not > > per-column. Is this intentional? > > > > On Sun, May 24, 2026 at 4:11 PM Florian R. Hölzlwimmer < > > [email protected]> wrote: > > > >> Hi all, > >> > >> Following a suggestion from @rok on GitHub, I'd like to open a > discussion > >> about adding a canonical extension type for bounded ranges to Arrow. > >> > >> Background > >> ========== > >> > >> So far, Arrow has no canonical way to represent a bounded range (a > >> mathematical interval with a lower and an upper endpoint), e.g. a > numeric > >> range `[0, 10)`, a date range, or a timestamp period. Today such data is > >> modeled ad hoc with two separate columns or with system-specific > extension > >> types, which hurts interoperability. A canonical range type will be > useful > >> to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database > >> connectors, ... > >> > >> This is distinct from Arrow's existing calendar `Interval` type > >> (`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`), > >> which represents a duration (a signed amount of time), not a bounded > set. > >> Databases like PostgreSQL make the same distinction: SQL uses `INTERVAL` > >> for durations and `RANGE` / `PERIOD` for bounded sets. This proposal > >> follows that convention by naming the type `arrow.range`. > >> > >> > >> Proposed design > >> =============== > >> > >> * Extension name: `arrow.range`. > >> > >> * Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is > >> nullable, a null bound represents an unbounded (infinite) endpoint. > >> > >> * Field names `lower` / `upper` follow PostgreSQL's convention > for > >> ordering clarity (Pandas uses `left` / `right`). > >> * The subtype `T` may be any orderable Arrow type (the numeric, > >> temporal and decimal families, etc.). Nested or non-comparable types are > >> out of scope. > >> > >> * Metadata: a JSON object `{"closed": "..."}` where `closed` is one > of > >> `left`, `right`, `both`, `neither` (pandas vocabulary; `left` = lower > >> inclusive / upper exclusive, etc.). Required on the wire so a serialized > >> `arrow.range` is always unambiguous. Unknown JSON keys are ignored for > >> forward compatibility. > >> > >> * A range is empty implicitly when `lower > upper`, or when `lower > == > >> upper` with at least one bound exclusive. A range with `lower > upper` > is > >> therefore valid (it denotes the empty set), not an error. > >> > >> This mirrors pandas' interval support closely enough that `arrow.range` > >> would give `pandas.IntervalArray` / `IntervalIndex` a natural, lossless > >> Arrow representation for round-tripping. > >> > >> > >> References > >> ========== > >> > >> - Full proposal and rationale: > >> https://github.com/apache/arrow/issues/50027 > >> - Draft C++/Python implementation: > >> https://github.com/apache/arrow/pull/50028 > >> > >> I'd appreciate any feedback on the overall direction and on the specific > >> design choices: field naming (`lower`/`upper` vs. `left`/`right`), the > >> `closed` parameter, the treatment of unbounded endpoints via > nullability, > >> and the set of supported subtypes. > >> > >> Many thanks, > >> Hoeze > >> >
