Hi Team, Resending the previous email about the Interval Type discussion from the last Parquet community sync, with some formatting adjustments.
The primary focus of the conversation is the proposed INTERVAL type's *compatibility with Apache Arrow*. Several key issues have been raised: 1. *Is there a more descriptive name for DayTimeInterval?* While the name DayTimeInterval closely follows the SQL standard and matches naming conventions used by most engines, some suggest that a name emphasizing precision—such as *DayNanoInterval*—might provide better clarity. 2. *Should we consider representing DayTimeInterval using Arrow's MonthDayNano?* Mapping DayTimeInterval to Arrow's MonthDayNano type is problematic due to semantic differences: - MonthDayNano combines both calendar-based and duration-based components, whereas DayTimeInterval represents a pure duration. - MonthDayNano allows mixed signs across components (e.g., positive months and negative days), which complicates comparison and evaluation. Given these differences, MonthDayNano is not a suitable candidate for representing DayTimeInterval and *we recommend not mapping DayTimeInterval to Arrow's MonthDayNano*. 3.* Memory Footprint: Is 16 bytes necessary for DayTimeInterval? * - Some engines (e.g., Spark, Trino) represent DayTimeInterval using only 8 bytes, while others (like Oracle and Snowflake) support a wider range, potentially requiring more than 8 bytes. Additionally, there is interest in future support for higher precision, such as picoseconds, which would also demand a larger footprint. - One proposal is to parameterize the size or precision, allowing engines to define their own representations. However, this approach introduces complexity and makes standardization difficult. A fixed-size format that provides enough range for most use cases is considered more robust. - Several alternative strategies have been proposed: 1. Use a 10-byte array, which is likely sufficient for all current engine requirements. 2. Use a 16-byte array now, with the option to evolve it into a standardized int128 in the future. 3. Start with an int64 representation, and plan for a future transition to int128, updating related types such as timestamps and intervals in parallel. Looking forward to hearing your thoughts on the above questions! Link to the proposal: https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk/edit?tab=t.0 Link to the PR: https://github.com/apache/parquet-format/pull/496/files Best Regards, Yun