Re: [Discussion] Follow up discussion for Interval Type

Felipe Oliveira Carvalho Thu, 03 Jul 2025 08:02:21 -0700

Hi,

I recently took the time to understand MonthDayNano better and it's not as
crazy as my first impression of it.

I think we could solve many of the problems you're trying to solve by
embracing it. Replies inline:

On Tue, Jul 1, 2025 at 2:56 AM yun zou <yunzou.colost...@gmail.com> wrote:

> Hi Team,
>
> Resending the previous email about the Interval Type discussion from the
> last Parquet community sync, with some formatting adjustments.
>
> The primary focus of the conversation is the proposed INTERVAL type's
> *compatibility
> with Apache Arrow*. Several key issues have been raised:
>
> 1. *Is there a more descriptive name for DayTimeInterval?*
> While the name DayTimeInterval closely follows the SQL standard and matches
> naming conventions used by most engines, some suggest that a name
> emphasizing precision—such as *DayNanoInterval*—might provide better
> clarity.
>
> 2. *Should we consider representing DayTimeInterval using Arrow's
> MonthDayNano?*
> Mapping DayTimeInterval to Arrow's MonthDayNano type is problematic due to
> semantic differences:
>
>    -  MonthDayNano combines both calendar-based and duration-based
>    components, whereas DayTimeInterval represents a pure duration.
>

Month and Day components actually allow the SQL engine to delay caring
about calendars when parsing interval literals:

    INTERVAL '2-11 28' YEAR TO DAY

month = 2 * 12 + 11. (there are no leap months, so converting from years to
month is easy)
day = 28
nanos = 0

If you get an interval/duration that is a single integer (as in the number
of nanoseconds), you can leave month and day set to 0. The annoying
limitation is that a 64-bit nanos field can't represent 10K years. How else
would you turn that weird SQL interval literal into a single integer
duration? Would you assume a year is 365 days?

This is the internal representation of intervals in PostgreSQL. It follows
the same structure as MonthDayNano: a 64-bit offset and 32-bit days and
months. This has the nice feature of aligning nicely to 8 bytes without
padding and still not requiring 128-bit math for operations.

/*
 * Storage format for type interval.
 */
typedef struct {
    TimeOffset time; /* all time units other than days, months and years */

     int32 day; /* days, after time for alignment */
     int32 month; /* months and years, after time for alignment */
} Interval;

   - MonthDayNano allows mixed signs across components (e.g., positive
>    months and negative days), which complicates comparison and evaluation.
>

Yes, but adding more types to Arrow also complicates engines and
implementations.

Adding month and days to a timestamp wouldn't be so hard if timestamps also
stored the number of months and days since an epoch, but they are usually a
single value (millis, micros, nanos) since an epoch. Couldn't an argument
be made that any SQL system already needs to convert between single-integer
timestamps and YYYY-MM-DD HH:MM:SS literals, so this operation is not that
hard to support?

Complicated expressions involving timestamps and intervals could be
evaluated by converting timestamps to MonthDayNano durations since epoch
making every value become a duration, then at the end convert the duration
since epoch to a single-integer timestamp. Given how SQL literals are, I
don't see how operations on these values can be correct without these
concerns being taken into account.

> Given these differences, MonthDayNano is not a suitable candidate for
> representing DayTimeInterval and *we recommend not mapping DayTimeInterval
> to Arrow's MonthDayNano*.
>
> 3.* Memory Footprint: Is 16 bytes necessary for DayTimeInterval? *
>
>    - Some engines (e.g., Spark, Trino) represent DayTimeInterval using only
>    8 bytes, while others (like Oracle and Snowflake) support a wider range,
>    potentially requiring more than 8 bytes. Additionally, there is
> interest in
>    future support for higher precision, such as picoseconds, which would
> also
>    demand a larger footprint.
>    - One proposal is to parameterize the size or precision, allowing
>    engines to define their own representations. However, this approach
>    introduces complexity and makes standardization difficult. A fixed-size
>    format that provides enough range for most use cases is considered more
>    robust.
>    - Several alternative strategies have been proposed:
>    1. Use a 10-byte array, which is likely sufficient for all current
>    engine requirements.
>    2. Use a 16-byte array now, with the option to evolve it into a
>    standardized int128 in the future.
>    3. Start with an int64 representation, and plan for a future transition
>    to int128, updating related types such as timestamps and intervals in
>    parallel.
>
> Looking forward to hearing your thoughts on the above questions!
>
> Link to the proposal:
>
> https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk/edit?tab=t.0
>
> Link to the PR: https://github.com/apache/parquet-format/pull/496/files
>
> Best Regards,
> Yun
>

So, can you expand on why MonthDayNano wouldn't work well? Yes, it's
complicated, but operating on Timestamps and Durations in SQL engines is a
complicated problem and after thinking about these types in the context of
a SQL engine I think the complexity of MonthDayNano is essential, not
accidental.

--
Felipe

Re: [Discussion] Follow up discussion for Interval Type

Reply via email to