Hi,

I would like to draw some attention to a format PR aiming to clarify
leap seconds, leap days and daylight saving handling semantics for
duration types: https://github.com/apache/arrow/pull/11138.

This came out of the effort [1] trying to implement Partial and Total
order for duration type DAY_TIME and MONTH_DAY_NANO.

In short, I am proposing we clarify the followings in the spec:

* For DAY_TIME duration, similar to Time and Timestamp, we do not take
leap seconds into account. But we take daylight saving into account.
As a result, days=1,ms=86400000 does not equal to days=2,ms=0.
* For MONTH_DAY_NANO, we do not take leap seconds into account. But we
take leap days into account. Whether we take leap days into account
doesn't really have a big impact here because the number of days in a
month already varies even without leap days.

A consequence of this is we will not be able to define total order for
both DAY_TIME and MONTH_DAY_NANO durations. Similar to floating point
values, we will only be able to define partial order for these two
types. This impacts downstream sorting compute kernels because we
can't simply sort these values by raw ints tuples lexicographically.

Another consequence of this is normalization cannot be applied to both
types, i.e. we can't normalize days=1,ms=86400000 into days=2 or
months=1,days=30 into months=2. This could simplify downstream hash
aggregate/join compute kernels because we can just hash the raw int
tuples to generate the hash keys.

[1]: https://github.com/jorgecarleitao/arrow2/pull/398

Thanks,
QP

Reply via email to