Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Zoltan Ivanfi Wed, 10 Jul 2019 07:22:11 -0700

Hi Wes,

Do you mean that the new logical types have already been released in 0.14.0
and a 0.14.1 is needed ASAP to fix this regression?


Thanks,

Zoltan

On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> issues and others we need to make a new release within the next 7-10
> days. We can point you to nightly Python builds to make testing for
> you easier so you don't have to build the project yourself.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <z...@cloudera.com.invalid>
> wrote:
> >
> > Hi,
> >
> > Oh, and one more thing: Before releasing the next Arrow version
> > incorporating the new logical types, we should definitely test that their
> > behaviour matches that of parquet-mr. When is the next release planned to
> > come out?
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <z...@cloudera.com> wrote:
> >
> > > Hi Wes,
> > >
> > > Yes, I agree that we should do that, but then we have a problem of
> what to
> > > do in the other direction, i.e. when we use the new logical types API
> to
> > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > normalized flag? Tim has started a discussion about that, suggesting
> to use
> > > three states that I just answered.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > >> Thank for the comments.
> > >>
> > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > >> types to maintain forward compatibility and stay consistent with what
> > >> we were doing in the C++ library prior to the introduction of the
> > >> LogicalType metadata.
> > >>
> > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <z...@cloudera.com.invalid
> >
> > >> wrote:
> > >> >
> > >> > Hi Wes,
> > >> >
> > >> > Both of the semantics are deterministic in one aspect and
> > >> indeterministic
> > >> > in another. Timestamps of instant semantic will always refer to the
> same
> > >> > instant, but their user-facing representation (how they get
> displayed)
> > >> > depends on the user's time zone. Timestamps of local semantics
> always
> > >> have
> > >> > the same user-facing representation but the instant they refer to is
> > >> > undefined (or ambigous, depending on your point of view).
> > >> >
> > >> > My understanding is that Spark uses instant semantics, i.e.,
> timestamps
> > >> are
> > >> > stored normalized to UTC and are displayed adjusted to the user's
> local
> > >> > time zone.
> > >> >
> > >> > Br,
> > >> >
> > >> > Zoltan
> > >> >
> > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <wesmck...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Thanks Zoltan.
> > >> > >
> > >> > > This is definitely a tricky issue.
> > >> > >
> > >> > > Spark's application of localtime semantics to timestamp data has
> been
> > >> > > a source of issues for many people. Personally I don't find that
> > >> > > behavior to be particularly helpful since depending on the session
> > >> > > time zone, you will get different results on data not marked as
> > >> > > UTC-normalized.
> > >> > >
> > >> > > In pandas, as contrast, we apply UTC semantics to
> > >> > > naive/not-explicitly-normalized data so at least the code produces
> > >> > > deterministic results on all environments. That seems strictly
> better
> > >> > > to me -- if you want a localized interpretation of naive data, you
> > >> > > must opt into this rather than having it automatically selected
> based
> > >> > > on your locale. The instances of people shooting their toes off
> due to
> > >> > > time zones are practically non-existent, whereas I'm hearing about
> > >> > > Spark gotchas all the time.
> > >> > >
> > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> <z...@cloudera.com.invalid
> > >> >
> > >> > > wrote:
> > >> > > >
> > >> > > > Hi Wes,
> > >> > > >
> > >> > > > The rules for TIMESTAMP forward-compatibility were created
> based on
> > >> the
> > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > >> been used
> > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > >> assumption was
> > >> > > > supported by two sources:
> > >> > > >
> > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> and
> > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> since
> > >> the
> > >> > > Unix
> > >> > > > epoch, an instant specified in UTC, from which it follows that
> they
> > >> have
> > >> > > > instant semantics (because timestamps of local semantics do not
> > >> > > correspond
> > >> > > > to a single instant).
> > >> > > >
> > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> component
> > >> that
> > >> > > > used these types differently from the specification.
> > >> > > >
> > >> > > > Based on your e-mail, we were wrong on #2.
> > >> > > >
> > >> > > > From this false premise it followed that TIMESTAMPs with local
> > >> semantics
> > >> > > > were a new type and did not need to be annotated with the old
> types
> > >> to
> > >> > > > maintain compatibility. In fact, annotating them with the old
> types
> > >> were
> > >> > > > considered to be harmful, since it would have mislead older
> readers
> > >> into
> > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> when in
> > >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> > >> instant
> > >> > > > semantics. This would have lead to a difference of several
> hours,
> > >> > > > corresponding to the time zone offset.
> > >> > > >
> > >> > > > In the light of your e-mail, this misinterpretation of
> timestamps
> > >> may
> > >> > > > already be happening, since if Arrow annotates local timestamps
> with
> > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> misinterprets
> > >> them
> > >> > > as
> > >> > > > timestamps with instant semantics, leading to a difference of
> > >> several
> > >> > > hours.
> > >> > > >
> > >> > > > Based on this, I think it would make sense from Arrow's point of
> > >> view to
> > >> > > > annotate both semantics with the old types, since that is its
> > >> historical
> > >> > > > behaviour and keeping it up is needed for maintaining
> compatibilty.
> > >> I'm
> > >> > > not
> > >> > > > so sure about the Java library though, since as far as I know,
> these
> > >> > > types
> > >> > > > were never used in the local sense there (although I may be
> wrong
> > >> again).
> > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > >> differently in
> > >> > > > this aspect though, it may be tricky to convey this distinction
> in
> > >> the
> > >> > > > specification. I would be interested in hearing your and other
> > >> > > developers'
> > >> > > > opinions on this.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Zoltan
> > >> > > >
> > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> wesmck...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > hi folks,
> > >> > > > >
> > >> > > > > We have just recently implemented the new LogicalType unions
> in
> > >> the
> > >> > > > > Parquet C++ library and we have run into a forward
> compatibility
> > >> > > > > problem with reader versions prior to this implementation.
> > >> > > > >
> > >> > > > > To recap the issue, prior to the introduction of LogicalType,
> the
> > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > >> > > > > normalization. The new TimestampType provides a flag to
> indicate
> > >> > > > > UTC-normalization
> > >> > > > >
> > >> > > > > struct TimestampType {
> > >> > > > > 1: required bool isAdjustedToUTC
> > >> > > > > 2: required TimeUnit unit
> > >> > > > > }
> > >> > > > >
> > >> > > > > When using this new type, the ConvertedType field must also be
> > >> set for
> > >> > > > > forward compatibility (so that old readers can still
> understand
> > >> the
> > >> > > > > data), but parquet.thrift says
> > >> > > > >
> > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > >> TIMESTAMP(isAdjustedToUTC =
> > >> > > > > true, unit = MICROS)
> > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > >> TIMESTAMP(isAdjustedToUTC =
> > >> > > > > true, unit = MILLIS)
> > >> > > > > 8: TimestampType TIMESTAMP
> > >> > > > >
> > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > >> > > > >
> > >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > >> > > > >
> > >> > > > > Prior to the introduction of LogicalType, we would set either
> > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > >> > > > > normalization. So when reading the data back, any notion of
> > >> having had
> > >> > > > > a time zone is lost (it could be stored in schema metadata if
> > >> > > > > desired).
> > >> > > > >
> > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> when
> > >> > > > > isAdjustedToUTC is true creates a forward compatibility break
> in
> > >> this
> > >> > > > > regard. This was reported to us shortly after releasing Apache
> > >> Arrow
> > >> > > > > 0.14.0:
> > >> > > > >
> > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > >> > > > >
> > >> > > > > We are discussing setting the ConvertedType unconditionally in
> > >> > > > >
> > >> > > > > https://github.com/apache/arrow/pull/4825
> > >> > > > >
> > >> > > > > This might need to be a setting that is toggled when data is
> > >> coming
> > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> > >> intended
> > >> > > > > forward compatibility interpretation, and if not should we
> amend.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Wes
> > >> > > > >
> > >> > >
> > >>
> > >
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Reply via email to