Hoeze opened a new issue, #50027:
URL: https://github.com/apache/arrow/issues/50027
### Describe the enhancement requested
Arrow has no canonical way to represent a bounded range (a mathematical
interval with a lower and an upper endpoint), e.g. a numeric range `[0, 10)`, a
date range, or a timestamp period. Today such data is modeled ad hoc with two
separate columns or with system-specific extension types, which hurts
interoperability. A canonical range type will be useful to libraries like
Pandas, Polars/Polars-bio, IRanges/PyRanges, database connectors, ...
Note this is distinct from Arrow's existing calendar `Interval` type
(`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`), which
represents a duration (a signed amount of time), not a bounded set. Databases
like PostgreSQL make the same distinction: SQL uses `INTERVAL` for durations
and `RANGE` / `PERIOD` for bounded sets. This proposal follows that convention
by naming the type `arrow.range`.
## Proposed design:
- Extension name: `arrow.range`.
- Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is nullable,
a null bound represents an unbounded (infinite) endpoint.
- Field names `lower` / `upper` (PostgreSQL convention) are chosen
deliberately for ordering clarity. (Note that Pandas uses `left` / `right` for
the field names)
- The subtype `T` may be any orderable Arrow type (the numeric, temporal
and decimal families, etc.). Nested or non-comparable types are out of scope.
- Metadata: a JSON object `{"closed": "..."}`.
- Parameter `closed`: one of `left`, `right`, `both`, `neither` (pandas
vocabulary; `left` = lower inclusive / upper exclusive, etc.).
- `closed` is required on the wire so a serialized `arrow.range` is always
unambiguous. Unknown JSON keys are ignored for forward compatibility.
- A range is empty implicitly when `lower > upper`, or when `lower == upper`
with at least one bound exclusive. A range with `lower > upper` is therefore
valid (it denotes the empty set), not an error.
### Relation to pandas
This mirrors pandas' interval support and deliberately reuses its vocabulary:
- `pandas.Interval` is the scalar form: an immutable bounded interval whose
`closed` parameter takes exactly `left`, `right`, `both`, or `neither`; the
vocabulary adopted here for the `closed` metadata.
- `pandas.IntervalIndex` / `pandas.arrays.IntervalArray` (dtype
`interval[T]`) is the columnar form: it stores parallel `left` and `right`
bound arrays with a single `closed` applying to every element, directly
analogous to the proposed `Struct<lower, upper>` storage with an object-level
`closed`.
- Note that closedness per-row is explicitly not a goal of `arrow.range`. If
needed, it could be achieved with a Union type.
### Component(s)
Format
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]