Hi, I recently took the time to understand MonthDayNano better and it's not as crazy as my first impression of it.
I think we could solve many of the problems you're trying to solve by embracing it. Replies inline: On Tue, Jul 1, 2025 at 2:56 AM yun zou <yunzou.colost...@gmail.com> wrote: > Hi Team, > > Resending the previous email about the Interval Type discussion from the > last Parquet community sync, with some formatting adjustments. > > The primary focus of the conversation is the proposed INTERVAL type's > *compatibility > with Apache Arrow*. Several key issues have been raised: > > 1. *Is there a more descriptive name for DayTimeInterval?* > While the name DayTimeInterval closely follows the SQL standard and matches > naming conventions used by most engines, some suggest that a name > emphasizing precision—such as *DayNanoInterval*—might provide better > clarity. > > 2. *Should we consider representing DayTimeInterval using Arrow's > MonthDayNano?* > Mapping DayTimeInterval to Arrow's MonthDayNano type is problematic due to > semantic differences: > > - MonthDayNano combines both calendar-based and duration-based > components, whereas DayTimeInterval represents a pure duration. > Month and Day components actually allow the SQL engine to delay caring about calendars when parsing interval literals: INTERVAL '2-11 28' YEAR TO DAY month = 2 * 12 + 11. (there are no leap months, so converting from years to month is easy) day = 28 nanos = 0 If you get an interval/duration that is a single integer (as in the number of nanoseconds), you can leave month and day set to 0. The annoying limitation is that a 64-bit nanos field can't represent 10K years. How else would you turn that weird SQL interval literal into a single integer duration? Would you assume a year is 365 days? This is the internal representation of intervals in PostgreSQL. It follows the same structure as MonthDayNano: a 64-bit offset and 32-bit days and months. This has the nice feature of aligning nicely to 8 bytes without padding and still not requiring 128-bit math for operations. /* * Storage format for type interval. */ typedef struct { TimeOffset time; /* all time units other than days, months and years */ int32 day; /* days, after time for alignment */ int32 month; /* months and years, after time for alignment */ } Interval; - MonthDayNano allows mixed signs across components (e.g., positive > months and negative days), which complicates comparison and evaluation. > Yes, but adding more types to Arrow also complicates engines and implementations. Adding month and days to a timestamp wouldn't be so hard if timestamps also stored the number of months and days since an epoch, but they are usually a single value (millis, micros, nanos) since an epoch. Couldn't an argument be made that any SQL system already needs to convert between single-integer timestamps and YYYY-MM-DD HH:MM:SS literals, so this operation is not that hard to support? Complicated expressions involving timestamps and intervals could be evaluated by converting timestamps to MonthDayNano durations since epoch making every value become a duration, then at the end convert the duration since epoch to a single-integer timestamp. Given how SQL literals are, I don't see how operations on these values can be correct without these concerns being taken into account. > Given these differences, MonthDayNano is not a suitable candidate for > representing DayTimeInterval and *we recommend not mapping DayTimeInterval > to Arrow's MonthDayNano*. > > 3.* Memory Footprint: Is 16 bytes necessary for DayTimeInterval? * > > - Some engines (e.g., Spark, Trino) represent DayTimeInterval using only > 8 bytes, while others (like Oracle and Snowflake) support a wider range, > potentially requiring more than 8 bytes. Additionally, there is > interest in > future support for higher precision, such as picoseconds, which would > also > demand a larger footprint. > - One proposal is to parameterize the size or precision, allowing > engines to define their own representations. However, this approach > introduces complexity and makes standardization difficult. A fixed-size > format that provides enough range for most use cases is considered more > robust. > - Several alternative strategies have been proposed: > 1. Use a 10-byte array, which is likely sufficient for all current > engine requirements. > 2. Use a 16-byte array now, with the option to evolve it into a > standardized int128 in the future. > 3. Start with an int64 representation, and plan for a future transition > to int128, updating related types such as timestamps and intervals in > parallel. > > Looking forward to hearing your thoughts on the above questions! > > Link to the proposal: > > https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk/edit?tab=t.0 > > Link to the PR: https://github.com/apache/parquet-format/pull/496/files > > Best Regards, > Yun > So, can you expand on why MonthDayNano wouldn't work well? Yes, it's complicated, but operating on Timestamps and Durations in SQL engines is a complicated problem and after thinking about these types in the context of a SQL engine I think the complexity of MonthDayNano is essential, not accidental. -- Felipe