For those that were following this topic a JIRA has since been created for interval ordering in the C++ compute engine so feel free to weigh in there: https://issues.apache.org/jira/browse/ARROW-14122
On Mon, Sep 20, 2021 at 2:45 AM Rok Mihevc <rok.mih...@gmail.com> wrote: > > It looks good to me. > > Rok > > On Mon, Sep 20, 2021 at 2:36 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > All, can you please take a look at QP's PR at > > https://github.com/apache/arrow/pull/11138 ? > > > > I don't believe this requires a vote as this clarification is consistent > > with the already clarified semantics for Time and Timestamp types. The > > current PR contents are ready for a merge, and I think they can be > > merged soon if nobody opposes. > > > > Regards > > > > Antoine. > > > > > > Le 17/09/2021 à 05:33, QP Hou a écrit : > > > Thank you for your feedback Weston and Antonie. I agree that ordering > > > discussion should be out of scope for the Arrow format spec. I have > > > removed reference of ordering in the PR so now the only change is > > > mentioning leap seconds to keep it consistent with other temporal > > > types. > > > > > > I would like to add that even though we are not explicitly discussing > > > ordering in the spec, any kind of restriction we assign to a type > > > would still implicitly impact ordering in downstream compute kernels. > > > This is why I also took out the discussion of leap days in my PR as > > > well. > > > > > > Thanks, > > > QP > > > > > > On Tue, Sep 14, 2021 at 12:46 AM Antoine Pitrou <anto...@python.org> > > wrote: > > >> > > >> > > >> I agree with Weston that ordering isn't in the scope for the Arrow > > >> format spec (*). For example, implementations are free to define UTF8 > > >> comparisons and ordering as they wish (some may want to invest in the > > >> complexity of the official Unicode collation algorithm, others may be > > >> content with a simple codepoint-wise lexicographic comparison). It > > >> doesn't prevent them from exchanging UTF8 data unambiguously using > > Arrow. > > >> > > >> (*) It may be in the scope for a hypothetical Compute IR spec, however. > > >> > > >> Regards > > >> > > >> Antoine. > > >> > > >> > > >> Le 14/09/2021 à 07:16, QP Hou a écrit : > > >>> Good point Weston. My proposal was written with the impression that > > >>> Arrow does want to define semantic for some of these temporal types > > >>> based on the existing comments in the Schema.fbs file. > > >>> > > >>> For example, here is a quote taken from the comments for the Time time: > > >>> > > >>> /// This definition doesn't allow for leap seconds. Time values from > > >>> /// measurements with leap seconds will need to be corrected when > > ingesting > > >>> /// into Arrow (for example by replacing the value 86400 with 86399). > > >>> > > >>> Here is another quote for the Date type: > > >>> > > >>> /// * Milliseconds (64 bits) indicating UNIX time elapsed since the > > epoch (no > > >>> /// leap seconds), where the values are evenly divisible by 86400000 > > >>> > > >>> For the interval type, we have: > > >>> > > >>> // A "calendar" interval which models types that don't necessarily > > >>> // have a precise duration without the context of a base timestamp > > (e.g. > > >>> // days can differ in length during day light savings time > > transitions). > > >>> > > >>> I think pushing the responsibility to define these semantics to the > > >>> data producer side is also a perfectly fine design with its own > > >>> trade-offs. It would make data exchange between two different systems > > >>> a little bit harder because consumers need to be aware of the > > >>> semantics defined by the producer. On the other hand, it does make the > > >>> producer implementation easier. It also makes data exchange within the > > >>> same system more efficient if that system's temporal type semantic is > > >>> different from what's defined in Arrow's spec. > > >>> > > >>> Either way, I think it would be good if we can be consistent on our > > >>> temporal type semantics in the spec. If we are making the claim that > > >>> leap seconds should not be taken into account for Time, Timestamp and > > >>> Date types, then it seems natural to make this claim for Interval type > > >>> as well. Alternatively, we could update the spec to make all temporal > > >>> types leap seconds agnostics. > > >>> > > >>> On Mon, Sep 13, 2021 at 12:03 PM Weston Pace <weston.p...@gmail.com> > > wrote: > > >>>> > > >>>> One could define a sorting based on 30 days months, 365 day years, and > > >>>> 24 hour days. It would be consistent but can lead to some surprising > > >>>> results. It appears that this is what postgres does as I got the > > >>>> following ordering for an interval: > > >>>> > > >>>> 359 days, 12 months, 360 days, 1 year, 365 days, 366 days > > >>>> > > >>>> On the other hand, Joda time forbids comparison of periods (their > > >>>> version of what we call an interval) and offers three ways to convert > > >>>> to a duration. There is toDurationFrom(instant), > > >>>> toDurationTo(instant) which give durations from specific calendar > > >>>> ranges and then there is toStandardDuration() which converts to a > > >>>> duration based on 24 hour days. However, toStandardDuration will > > >>>> still fail if the period has >0 months or years (presumably because > > >>>> months and years are too inconsistent). > > >>>> > > >>>> I'm not sure though that this is something that Arrow needs to define. > > >>>> We aren't specifying any invalid ranges of values. I don't foresee > > >>>> any interoperability concerns. A system that treated intervals as > > >>>> comparable (and didn't factor in DST, leap years, etc.) will read and > > >>>> write intervals the same way as a system that considers intervals > > >>>> incomparable. > > >>>> > > >>>> This question seems to fall into the "compute" space inhabited by > > >>>> topics like "is 'false && null' a false value or a null value" and > > >>>> "should addition overflow or throw an exception". > > >>>> > > >>>> On Mon, Sep 13, 2021 at 6:23 AM QP Hou <houqp....@gmail.com> wrote: > > >>>>> > > >>>>> On Mon, Sep 13, 2021 at 6:18 AM Antoine Pitrou <anto...@python.org> > > wrote: > > >>>>>> The Duration type is defined with a TimeUnit. You are probably > > thinking > > >>>>>> about the Interval type. > > >>>>>> > > >>>>> > > >>>>> Oops, my bad, yes, it should be Interval type not Duration. > > >>>>> > > >>>>>> Ok. How about daylight savings? I suppose they are taken into > > account > > >>>>>> as well. > > >>>>>> > > >>>>> > > >>>>> Yes, the day component in both DAY_TIME and MONTH_DAY_NANO all take > > >>>>> into account of daylight savings. > >