Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

Jacques Nadeau Wed, 08 Nov 2017 15:06:07 -0800

My analysis previously (if I recall) was basically (I think I did it on the
similar Parquet PR) was that no system truly supported N fields for all
operations (postgres was closest I believe). Some basic operations would
maintain them but they would quickly not behave differently (e.g. 24 hours
is the same as one day in basically all cases).


On Wed, Nov 8, 2017 at 2:54 PM, Julian Hyde <jh...@apache.org> wrote:

> I have argued before on this list, and still believe, that you should
> represent an interval as you would a number. If intervals are 64 bit
> signed, then sure, use the 64 bit integer representation; if you were
> to allow intervals with fixed precision and scale, then use the same
> representation as for decimal(p, s).
>
> My understanding of timedelta is that you can't use a 2-field struct.
> You need timedelta (3 hour 20 minutes 7 seconds) to be distinguishable
> from timedelta (2 hour 79 minutes 67 seconds), correct? If so you need
> N fields.
>
> Julian
>
>
> On Wed, Nov 8, 2017 at 2:47 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> > Makes sense. The key question is whether the data is represented as a
> > single 64-bit integer or as effectively a C struct
> >
> > struct {
> >   int32_t days;
> >   int32_t milliseconds;
> > }
> >
> > The struct representation cannot accommodate higher resolution units
> > like microseconds and nanoseconds. From my perspective, if we use the
> > 64-bit integer representation (which is for the millis case days *
> > 86400000 + milliseconds) then whether we call it Interval or Timedelta
> > is sort of immaterial to the immediate use cases I have, where users
> > have a timedelta64[UNIT] type (which results from any arithmetic
> > between timestamp values)
> >
> > On Wed, Nov 8, 2017 at 5:38 PM, Julian Hyde <jh...@apache.org> wrote:
> >> I don't know many examples of interval being used in the real world.
> >> But here's the kind of thing: the policy is that an offer is open for
> >> 60 hours, so if the offer is made to a particular customer at 12:34pm
> >> on Sunday, you want to compute that it ends at 12:34am on Wednesday.
> >> The interval "60 hours" is really just syntactic sugar for 216,000
> >> seconds. You could write it as interval '60' hour, or interval '2:12'
> >> day to hour, or interval '129600' second, but the values and
> >> underlying representation are the same. (Interval '1:36' day to hour
> >> is not a valid value, because 36 is out of the valid hour range 0..23,
> >> but you could construct the value using interval '1' day + 36 *
> >> interval '1' hour.)
> >>
> >> My understanding is that a timedelta (2 day 12 hours) is different
> >> from timedelta (60 hours) and timedelta (1 day 36 hours), but all are
> >> valid timedelta values.
> >>
> >> For my offer expiration example the SQL-style interval is sufficient,
> >> because there is no material difference between 2:12 and 1:36.
> >>
> >> But I am sure you can provide use cases where timedelta is necessary.
> >>
> >> I don't claim one is better than the other, and I'm not volunteering
> >> to implement either of them, so I don't have a say which you should do
> >> first. But please keep the names "interval" and "timedelta", so the
> >> various communities aren't confused about semantics.
> >>
> >>
> >> On Wed, Nov 8, 2017 at 2:15 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>> Pleading ignorance on use of the SQL interval type, my prior would be
> >>> that many algorithms would first convert the interval components into
> >>> an absolute timedelta. Is that not the case?
> >>>
> >>> My preference right now would be to have a single Interval type, where
> >>> the DAY_TIME type actually contains an absolute delta based on the
> >>> indicated unit. It is true that the divmod operation to decompose into
> >>> number of days and intraday units (milliseconds, nanoseconds, etc.) is
> >>> not the cheapest, but I don't know the use cases for the type well
> >>> enough to judge.
> >>>
> >>> On Wed, Nov 8, 2017 at 5:10 PM, Jacques Nadeau <jacq...@apache.org>
> wrote:
> >>>> I'm all for moving interval to the new definition. I think we should
> avoid
> >>>> introducing a timedelta type until it is really important. We need
> several
> >>>> users demanding a type before we should implement it. Otherwise, we
> have
> >>>> huge amounts of type bloat (which means nothing will fully implement
> the
> >>>> spec and be able to interoperate).
> >>>>
> >>>> On Sat, Nov 4, 2017 at 3:46 PM, Julian Hyde <jh...@apache.org> wrote:
> >>>>
> >>>>> As I understand it, the proposal is to have both an interval data
> type[1]
> >>>>> and a timedelta type[2].  The interval is compatible with the SQL
> standard
> >>>>> (but not Postgres) and can be implemented with a single numeric value
> >>>>> representing a particular time unit (year, month, day, hour, minute,
> >>>>> second, and possibly fractional seconds); timedelta is an array of
> numeric
> >>>>> values, one for a set of time units.
> >>>>>
> >>>>> I think we should have both, and operators to convert between them.
> >>>>> Interval is certainly efficient, and is what some applications need,
> but
> >>>>> some applications need timedelta.
> >>>>>
> >>>>> Julian
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/ARROW-352 <
> >>>>> https://issues.apache.org/jira/browse/ARROW-352>
> >>>>>
> >>>>> [2] https://issues.apache.org/jira/browse/ARROW-835 <
> >>>>> https://issues.apache.org/jira/browse/ARROW-835>
> >>>>>
> >>>>> > On Nov 4, 2017, at 1:26 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>>>> >
> >>>>> > It seems like we don't have enough input on this topic to make a
> >>>>> > decision right now. I placed the JIRA ARROW-352 in the 0.9.0
> >>>>> > milestone, but we really should try to get this done soon so that
> >>>>> > downstream users are not blocked on using Arrow to send around
> >>>>> > interval data.
> >>>>> >
> >>>>> > - Wes
> >>>>> >
> >>>>> > On Fri, Oct 20, 2017 at 12:34 AM, Li Jin <ice.xell...@gmail.com>
> wrote:
> >>>>> >> +1 on this one.
> >>>>> >>
> >>>>> >> My reason is this makes timestamp/interval calculation faster,
> i.e,
> >>>>> >> "timestamp + interval < timestamp" should be faster without
> dealing with
> >>>>> >> two component in interval. Although I am not quite sure about the
> >>>>> rational
> >>>>> >> behind the two component representation, which seems to be what
> is used
> >>>>> in
> >>>>> >> Spark:
> >>>>> >>
> >>>>> >> https://github.com/apache/spark/blob/master/common/
> >>>>> unsafe/src/main/java/org/apache/spark/unsafe/types/
> CalendarInterval.java
> >>>>> >>
> >>>>> >> I am interested in hearing reasoning behind two component.
> >>>>> >>
> >>>>> >> On Wed, Oct 18, 2017 at 8:32 PM, Wes McKinney <
> wesmck...@gmail.com>
> >>>>> wrote:
> >>>>> >>
> >>>>> >>> I opened this patch over 2 months ago to add some additional
> metadata
> >>>>> >>> for intervals:
> >>>>> >>>
> >>>>> >>> https://github.com/apache/arrow/pull/920
> >>>>> >>>
> >>>>> >>> Java supports a two-component DAY_TIME interval type as a combo
> of
> >>>>> >>> days and milliseconds:
> >>>>> >>>
> >>>>> >>> https://github.com/apache/arrow/blob/
> 402baa4ec391b61dd37c770ae7978d
> >>>>> >>> 51b9b550fa/java/vector/src/main/codegen/data/
> ValueVectorTypes.tdd#L106
> >>>>> >>>
> >>>>> >>> I propose that we change the interval representation to be a
> number of
> >>>>> >>> elapsed units of time from a particular point in time. This unit
> >>>>> >>> choices would be the same as our unit for timestamps, so an
> interval
> >>>>> >>> can be viewed as a delta between two timestamps of some
> resolution
> >>>>> >>> (second through nanoseconds) [1].
> >>>>> >>>
> >>>>> >>> As context, a number of systems I have worked with deal in
> absolute
> >>>>> >>> time deltas. In pandas, for example, the difference of timestamps
> >>>>> >>> (datetime64 values) is a timedelta:
> >>>>> >>>
> >>>>> >>> In [1]: import pandas as pd
> >>>>> >>>
> >>>>> >>> In [2]: dr1 = pd.date_range('1/1/2000', periods=5)
> >>>>> >>>
> >>>>> >>> In [3]: dr2 = pd.date_range('1/2/2000', periods=5)
> >>>>> >>>
> >>>>> >>> In [4]: dr1 - dr2
> >>>>> >>> Out[4]: TimedeltaIndex(['-1 days', '-1 days', '-1 days', '-1
> days',
> >>>>> >>> '-1 days'], dtype='timedelta64[ns]', freq=None)
> >>>>> >>>
> >>>>> >>> In [5]: (dr1 - dr2).values
> >>>>> >>> Out[5]:
> >>>>> >>> array([-86400000000000, -86400000000000, -86400000000000,
> >>>>> -86400000000000,
> >>>>> >>>       -86400000000000], dtype='timedelta64[ns]')
> >>>>> >>>
> >>>>> >>> We need to be able to represent this data coherently (up to
> nanosecond
> >>>>> >>> resolution) with the Arrow metadata, and we will also at some
> point
> >>>>> >>> need to perform analytics directly on this data type.
> >>>>> >>>
> >>>>> >>> An alternative proposal to changing the DAY_TIME interval
> >>>>> >>> representation is to add another kind of interval type, so
> instead of
> >>>>> >>> only YEAR_MONTH and DAY_TIME, we have TIMEDELTA. The downside of
> this,
> >>>>> >>> of course, is the extra implementation complexity. DAY_TIME with
> the
> >>>>> >>> current Java representation also seems to me to be a subset of
> what
> >>>>> >>> you can represent with TIMEDELTA.
> >>>>> >>>
> >>>>> >>> It would be great to make a decision about this so we can get
> this
> >>>>> >>> metadata finalized in the 0.8.0 release.
> >>>>> >>>
> >>>>> >>> Thanks
> >>>>> >>> Wes
> >>>>> >>>
> >>>>> >>> [1]: https://github.com/apache/arrow/blob/master/format/
> >>>>> Schema.fbs#L135
> >>>>> >>>
> >>>>>
> >>>>>
>

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

Reply via email to