Hi Team,

Resending the previous email about the Interval Type discussion from the
last Parquet community sync, with some formatting adjustments.

The primary focus of the conversation is the proposed INTERVAL type's
*compatibility
with Apache Arrow*. Several key issues have been raised:

1. *Is there a more descriptive name for DayTimeInterval?*
While the name DayTimeInterval closely follows the SQL standard and matches
naming conventions used by most engines, some suggest that a name
emphasizing precision—such as *DayNanoInterval*—might provide better
clarity.

2. *Should we consider representing DayTimeInterval using Arrow's
MonthDayNano?*
Mapping DayTimeInterval to Arrow's MonthDayNano type is problematic due to
semantic differences:

   -  MonthDayNano combines both calendar-based and duration-based
   components, whereas DayTimeInterval represents a pure duration.
   - MonthDayNano allows mixed signs across components (e.g., positive
   months and negative days), which complicates comparison and evaluation.

Given these differences, MonthDayNano is not a suitable candidate for
representing DayTimeInterval and *we recommend not mapping DayTimeInterval
to Arrow's MonthDayNano*.

3.* Memory Footprint: Is 16 bytes necessary for DayTimeInterval? *

   - Some engines (e.g., Spark, Trino) represent DayTimeInterval using only
   8 bytes, while others (like Oracle and Snowflake) support a wider range,
   potentially requiring more than 8 bytes. Additionally, there is interest in
   future support for higher precision, such as picoseconds, which would also
   demand a larger footprint.
   - One proposal is to parameterize the size or precision, allowing
   engines to define their own representations. However, this approach
   introduces complexity and makes standardization difficult. A fixed-size
   format that provides enough range for most use cases is considered more
   robust.
   - Several alternative strategies have been proposed:
   1. Use a 10-byte array, which is likely sufficient for all current
   engine requirements.
   2. Use a 16-byte array now, with the option to evolve it into a
   standardized int128 in the future.
   3. Start with an int64 representation, and plan for a future transition
   to int128, updating related types such as timestamps and intervals in
   parallel.

Looking forward to hearing your thoughts on the above questions!

Link to the proposal:
https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk/edit?tab=t.0

Link to the PR: https://github.com/apache/parquet-format/pull/496/files

Best Regards,
Yun

Reply via email to