Thank you, that looks like a good summary to me.

I guess a third option would be for the new INTERVAL_DAY_TIME to annotate a composite type of days and milliseconds, as originally proposed back in 2016 [1]. This would have the benefit of preserving the information necessary for engines to treat days as calendar days if they choose to do so, whilst also mapping to an existing arrow data type. For the purposes of a sort order days could be defined as 24 hours, with engines free to ignore this if they wish to treat intervals differently. The major downside is no nanosecond precision.

A fourth possibility I can see, that could also be done in parallel, would be to define varying precision duration types similar to timestamps. This would be functionally identical to the proposed DAY-TIME integer, but without the rather confusing characteristic of not actually storing a number of days or really being an interval in the universally understood sense. This would also provide a standardized way to map the arrow duration types, which currently rely on arrow-specific metadata to be encoded.

I'm sure people smarter than I can devise further alternative proposals, however, I think that any proposal must preserve existing functionality for it to be well adopted. In particular any proposal without a good story for encoding calendar days and mapping to arrow is likely to fail to replace the existing INTERVAL type for which these are well defined.

Kind Regards,

Raphael

[1]: https://github.com/apache/parquet-format/pull/43

On 07/07/2025 07:37, Micah Kornfield wrote:
I think I've lost the thread a little bit on this discussion.  I'd like to
try to summarize my understanding of the issues at hand and current
proposals.

As a short summary, my understanding of the requirements we want for a new
parquet day-time interval type are:

1.  Represent +/- 10K years for at least nano-second granularity (IIUC the
range is specified by ANSI SQL.  The granularity is picked for convenience
that covers most use cases).
2.  Allow pruning via min-max statistics

Shortcomings of the existing interval type:
1.  Millisecond granularity.
2.  Only supports positive values (all integer in FLBA of 12 bytes are
unsigned).
3.  No sort order

Proposals:
1.  Fixed width integer duration type (assumes days are always exactly
86400 seconds)
     Pros:
        - Easy to model.
        - Maps easily to ANSI SQL Day Time Interval representations that
several engines use.
     Cons:
        - Doesn't match well with any Arrow type.
        - Doesn't capture semantics for engines that treat day as a calendar
type.

2.  IIUC, Something based off of Arrow's MonthDayNanos type.  I apologize,
I'm not sure if this is what is being proposed but here are the approaches
i see with this:
    - FLBA 16 byte type exactly like arrow
    - Shredded (month, day and nanos each have a separate field under a
struct).  The main complication with this would be statistics since it is
all on the leaf fields, so there is either some information loss OR a
convention for recording statistics on one of the leafs need to followed).

    - Pros:
       - Potentially more efficient storage
       - Maps directly to one of Arrow's Interval types.
    - Cons:
       - Comparison in general is challenging. Perhaps this can be solved by
new sort order that clarify relationships between fields or additional
metadata on the logic type. e.g. UNORDERED (Arrow semantics) or YEAR-MONTH
(only month field is expected to be populated and values are ordered on it)
or  DAY-TIME ordered (A day means 86400 seconds and nanoseconds normalized
to be less then 24 hours).
       - Doesn't conform to the notion of ANSI SQL interval types (which
have two separate representations for Year/Month and Day Time).  So some
parquet files might not be readable by ANSI SQL.
       - For the likely uncommon case (intervals ~+/- 200 years) might
require more transformation to an engine's internal representation.
       - Only works for the SQL standard if days are assumed to be 86400
seconds (right now fields are specified as independent of each other).


Did I get the requirements right?  Are there some other options people were
thinking about?

Thanks,
Micah



On Fri, Jul 4, 2025 at 6:49 AM Antoine Pitrou <anto...@python.org> wrote:

On Thu, 3 Jul 2025 17:22:27 +0100
Raphael Taylor-Davies
<r.taylordav...@googlemail.com.INVALID>
wrote:
Hi,

I may be misunderstanding something, but it appears that the motivation
for this effort is that the current interval type represents a superset
of the functionality required by ANSI SQL, and therefore can cause
compatibility problems for some databases that only support the minimum
required by ANSI SQL?
While I understand the desire to be able to represent all values
allowable in ANSI SQL, I really don't understand why our types should
not be allowed to represent any values *outside* of the range allowed
in ANSI SQL.

Please let's be mindful that Parquet is not useful only for SQL-type
workloads. Besides, ANSI SQL itself might evolve and we don't want to
add another Interval type in a few years because the one we're
current specifying ends up too tight.

(if some people want to make sure that values don't fall outside the
ANSI SQL range, they can write a validation pass for it; no need to
burden the Parquet *format* with such contraints)

Regards

Antoine.



Reply via email to