Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Zoltan Ivanfi Thu, 11 Jul 2019 07:54:44 -0700

Hi Wes,

Sounds good to me.


Br,

Zoltan

On Thu, Jul 11, 2019 at 4:14 PM Wes McKinney <[email protected]> wrote:
>
> On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi <[email protected]> 
> wrote:
> >
> > Hi Wes,
> >
> > I did a little bit of testing using pyarrow 0.14.0. I know that this
> > is not the latest state of the code, but my understanding is that the
> > only planned change is that 0.14.1 will add the legacy types even for
> > local semantics. Here is what I did:
> >
> >     In [1]: import pandas as pd
> >        ...: from datetime import datetime
> >        ...: from pandas import Timestamp
> >        ...: from pytz import timezone
> >        ...: ts_dt = datetime(1970, 1, 1)
> >        ...: ts_pd = Timestamp(year=1970, month=1, day=1)
> >        ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1,
> > tz=timezone("Europe/Paris"))
> >        ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1,
> > hour=2, tz=timezone("Europe/Helsinki"))
> >        ...: df = pd.DataFrame({
> >        ...:     'datetime': [ts_dt, None],
> >        ...:     'pd_no_tz': [ts_pd, None],
> >        ...:     'pd_paris': [ts_pd_paris, None],
> >        ...:     'pd_helsinki': [ts_pd_helsinki, None],
> >        ...:     'pd_mixed': [ts_pd_paris, ts_pd_helsinki]
> >        ...: })
> >        ...: df.to_parquet('test.parquet')
> >        ...: df
> >     Out[1]:
> >         datetime   pd_no_tz                  pd_paris
> > pd_helsinki                   pd_mixed
> >     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> > 02:00:00+02:00  1970-01-01 01:00:00+01:00
> >     1        NaT        NaT                       NaT
> >      NaT  1970-01-01 02:00:00+02:00
> >
> > I picked these values because I expected all of these timestamps to be
> > saved as the numeric value 0. Checking the metadata using
> > parquet-tools:
> >
> >     > parquet-tools meta test.parquet
> >     datetime:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
> >     pd_no_tz:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
> >     pd_paris:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> >     pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> >     pd_mixed:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
> >
> > This matched my expectations up until pd_mixed. I was surprised to see
> > that timestamps with mixed time zones were be stored using local
> > semantics instead of being normalized to UTC, but if it's an
> > intentional choice, I'm fine with it as long as the numbers are
> > correct:
> >
> >     > parquet-tools head test.parquet
> >     datetime = 0
> >     pd_no_tz = 0
> >     pd_paris = 0
> >     pd_helsinki = 0
> >     pd_mixed = 3600000
> >     pd_mixed = 7200000
> >
> > The numbers for the mixed column are not 0, but that is just the
> > result of not using UTC-normalized semantics there. The values can be
> > correctly interpreted by parquet-tools:
> >
> >     > parquet-tools dump test.parquet
> >     INT64 datetime
> >     
> > --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_no_tz
> >     
> > --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_paris
> >     
> > --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_helsinki
> >     
> > --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_mixed
> >     
> > --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T01:00:00.000
> >     value 2: R:0 D:1 V:1970-01-01T02:00:00.000
> >
> > And naturally by pandas as well:
> >
> >     In [2]: df2 = pd.read_parquet('test.parquet')
> >        ...: df2
> >     Out[2]:
> >         datetime   pd_no_tz                  pd_paris
> > pd_helsinki            pd_mixed
> >     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> > 02:00:00+02:00 1970-01-01 01:00:00
> >     1        NaT        NaT                       NaT
> >      NaT 1970-01-01 02:00:00
> >
> > Finally, let's try an older version of parquet-tools as well:
> >
> >     > parquet-tools meta test.parquet
> >     datetime:    OPTIONAL INT64 R:0 D:1
> >     pd_no_tz:    OPTIONAL INT64 R:0 D:1
> >     pd_paris:    OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
> >     pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
> >     pd_mixed:    OPTIONAL INT64 R:0 D:1
> >
>
> Just to confirm that in pyarrow 0.13.0 (which tracks
> parquet-cpp-in-development 1.6.0, prior to introduction of
> LogicalType), TIMESTAMP_* is set for all 5 columns. This is the
> forward compatibility that we are planning to restore.
>
> > This confirms that the legacy types are only written for
> > UTC-normalized timestamps in 0.14.0.
> >
> > In summary, when saving two timestamps from different time zones that
> > refer to the same instant, I would have expected pyarrow to normalize
> > to UTC and thereby sacrifice the local representations instead of
> > saving using local semantics and thereby sacrificing the instants. I
> > don't know whether that is the intended behaviour, but in any case,
> > based on this short manual testing, the new timestamp types written by
> > pyarrow are interopable with the Java library.
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney <[email protected]> wrote:
> > >
> > > Correct
> > >
> > > On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <[email protected]> 
> > > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > Do you mean that the new logical types have already been released in 
> > > > 0.14.0
> > > > and a 0.14.1 is needed ASAP to fix this regression?
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <[email protected]> 
> > > > wrote:
> > > >
> > > > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > > > > issues and others we need to make a new release within the next 7-10
> > > > > days. We can point you to nightly Python builds to make testing for
> > > > > you easier so you don't have to build the project yourself.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi 
> > > > > <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Oh, and one more thing: Before releasing the next Arrow version
> > > > > > incorporating the new logical types, we should definitely test that 
> > > > > > their
> > > > > > behaviour matches that of parquet-mr. When is the next release 
> > > > > > planned to
> > > > > > come out?
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <[email protected]> 
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Wes,
> > > > > > >
> > > > > > > Yes, I agree that we should do that, but then we have a problem of
> > > > > what to
> > > > > > > do in the other direction, i.e. when we use the new logical types 
> > > > > > > API
> > > > > to
> > > > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set 
> > > > > > > the UTC
> > > > > > > normalized flag? Tim has started a discussion about that, 
> > > > > > > suggesting
> > > > > to use
> > > > > > > three states that I just answered.
> > > > > > >
> > > > > > > Br,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <[email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > >> Thank for the comments.
> > > > > > >>
> > > > > > >> So in summary I think that we need to set the TIMESTAMP_* 
> > > > > > >> converted
> > > > > > >> types to maintain forward compatibility and stay consistent with 
> > > > > > >> what
> > > > > > >> we were doing in the C++ library prior to the introduction of the
> > > > > > >> LogicalType metadata.
> > > > > > >>
> > > > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi 
> > > > > > >> <[email protected]
> > > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > Hi Wes,
> > > > > > >> >
> > > > > > >> > Both of the semantics are deterministic in one aspect and
> > > > > > >> indeterministic
> > > > > > >> > in another. Timestamps of instant semantic will always refer 
> > > > > > >> > to the
> > > > > same
> > > > > > >> > instant, but their user-facing representation (how they get
> > > > > displayed)
> > > > > > >> > depends on the user's time zone. Timestamps of local semantics
> > > > > always
> > > > > > >> have
> > > > > > >> > the same user-facing representation but the instant they refer 
> > > > > > >> > to is
> > > > > > >> > undefined (or ambigous, depending on your point of view).
> > > > > > >> >
> > > > > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > > > > timestamps
> > > > > > >> are
> > > > > > >> > stored normalized to UTC and are displayed adjusted to the 
> > > > > > >> > user's
> > > > > local
> > > > > > >> > time zone.
> > > > > > >> >
> > > > > > >> > Br,
> > > > > > >> >
> > > > > > >> > Zoltan
> > > > > > >> >
> > > > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney 
> > > > > > >> > <[email protected]>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Thanks Zoltan.
> > > > > > >> > >
> > > > > > >> > > This is definitely a tricky issue.
> > > > > > >> > >
> > > > > > >> > > Spark's application of localtime semantics to timestamp data 
> > > > > > >> > > has
> > > > > been
> > > > > > >> > > a source of issues for many people. Personally I don't find 
> > > > > > >> > > that
> > > > > > >> > > behavior to be particularly helpful since depending on the 
> > > > > > >> > > session
> > > > > > >> > > time zone, you will get different results on data not marked 
> > > > > > >> > > as
> > > > > > >> > > UTC-normalized.
> > > > > > >> > >
> > > > > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > > > > >> > > naive/not-explicitly-normalized data so at least the code 
> > > > > > >> > > produces
> > > > > > >> > > deterministic results on all environments. That seems 
> > > > > > >> > > strictly
> > > > > better
> > > > > > >> > > to me -- if you want a localized interpretation of naive 
> > > > > > >> > > data, you
> > > > > > >> > > must opt into this rather than having it automatically 
> > > > > > >> > > selected
> > > > > based
> > > > > > >> > > on your locale. The instances of people shooting their toes 
> > > > > > >> > > off
> > > > > due to
> > > > > > >> > > time zones are practically non-existent, whereas I'm hearing 
> > > > > > >> > > about
> > > > > > >> > > Spark gotchas all the time.
> > > > > > >> > >
> > > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> > > > > <[email protected]
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > Hi Wes,
> > > > > > >> > > >
> > > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > > > > based on
> > > > > > >> the
> > > > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have 
> > > > > > >> > > > only
> > > > > > >> been used
> > > > > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > > > > >> assumption was
> > > > > > >> > > > supported by two sources:
> > > > > > >> > > >
> > > > > > >> > > > 1. The specification: parquet-format defined 
> > > > > > >> > > > TIMESTAMP_MILLIS
> > > > > and
> > > > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds 
> > > > > > >> > > > elapsed
> > > > > since
> > > > > > >> the
> > > > > > >> > > Unix
> > > > > > >> > > > epoch, an instant specified in UTC, from which it follows 
> > > > > > >> > > > that
> > > > > they
> > > > > > >> have
> > > > > > >> > > > instant semantics (because timestamps of local semantics 
> > > > > > >> > > > do not
> > > > > > >> > > correspond
> > > > > > >> > > > to a single instant).
> > > > > > >> > > >
> > > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> > > > > component
> > > > > > >> that
> > > > > > >> > > > used these types differently from the specification.
> > > > > > >> > > >
> > > > > > >> > > > Based on your e-mail, we were wrong on #2.
> > > > > > >> > > >
> > > > > > >> > > > From this false premise it followed that TIMESTAMPs with 
> > > > > > >> > > > local
> > > > > > >> semantics
> > > > > > >> > > > were a new type and did not need to be annotated with the 
> > > > > > >> > > > old
> > > > > types
> > > > > > >> to
> > > > > > >> > > > maintain compatibility. In fact, annotating them with the 
> > > > > > >> > > > old
> > > > > types
> > > > > > >> were
> > > > > > >> > > > considered to be harmful, since it would have mislead older
> > > > > readers
> > > > > > >> into
> > > > > > >> > > > thinking that they can read TIMESTAMPs with local 
> > > > > > >> > > > semantics,
> > > > > when in
> > > > > > >> > > > reality they would have misinterpreted them as TIMESTAMPs 
> > > > > > >> > > > with
> > > > > > >> instant
> > > > > > >> > > > semantics. This would have lead to a difference of several
> > > > > hours,
> > > > > > >> > > > corresponding to the time zone offset.
> > > > > > >> > > >
> > > > > > >> > > > In the light of your e-mail, this misinterpretation of
> > > > > timestamps
> > > > > > >> may
> > > > > > >> > > > already be happening, since if Arrow annotates local 
> > > > > > >> > > > timestamps
> > > > > with
> > > > > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> > > > > misinterprets
> > > > > > >> them
> > > > > > >> > > as
> > > > > > >> > > > timestamps with instant semantics, leading to a difference 
> > > > > > >> > > > of
> > > > > > >> several
> > > > > > >> > > hours.
> > > > > > >> > > >
> > > > > > >> > > > Based on this, I think it would make sense from Arrow's 
> > > > > > >> > > > point of
> > > > > > >> view to
> > > > > > >> > > > annotate both semantics with the old types, since that is 
> > > > > > >> > > > its
> > > > > > >> historical
> > > > > > >> > > > behaviour and keeping it up is needed for maintaining
> > > > > compatibilty.
> > > > > > >> I'm
> > > > > > >> > > not
> > > > > > >> > > > so sure about the Java library though, since as far as I 
> > > > > > >> > > > know,
> > > > > these
> > > > > > >> > > types
> > > > > > >> > > > were never used in the local sense there (although I may be
> > > > > wrong
> > > > > > >> again).
> > > > > > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > > > > > >> differently in
> > > > > > >> > > > this aspect though, it may be tricky to convey this 
> > > > > > >> > > > distinction
> > > > > in
> > > > > > >> the
> > > > > > >> > > > specification. I would be interested in hearing your and 
> > > > > > >> > > > other
> > > > > > >> > > developers'
> > > > > > >> > > > opinions on this.
> > > > > > >> > > >
> > > > > > >> > > > Thanks,
> > > > > > >> > > >
> > > > > > >> > > > Zoltan
> > > > > > >> > > >
> > > > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> > > > > [email protected]>
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > hi folks,
> > > > > > >> > > > >
> > > > > > >> > > > > We have just recently implemented the new LogicalType 
> > > > > > >> > > > > unions
> > > > > in
> > > > > > >> the
> > > > > > >> > > > > Parquet C++ library and we have run into a forward
> > > > > compatibility
> > > > > > >> > > > > problem with reader versions prior to this 
> > > > > > >> > > > > implementation.
> > > > > > >> > > > >
> > > > > > >> > > > > To recap the issue, prior to the introduction of 
> > > > > > >> > > > > LogicalType,
> > > > > the
> > > > > > >> > > > > Parquet format had no explicit notion of time zones or 
> > > > > > >> > > > > UTC
> > > > > > >> > > > > normalization. The new TimestampType provides a flag to
> > > > > indicate
> > > > > > >> > > > > UTC-normalization
> > > > > > >> > > > >
> > > > > > >> > > > > struct TimestampType {
> > > > > > >> > > > > 1: required bool isAdjustedToUTC
> > > > > > >> > > > > 2: required TimeUnit unit
> > > > > > >> > > > > }
> > > > > > >> > > > >
> > > > > > >> > > > > When using this new type, the ConvertedType field must 
> > > > > > >> > > > > also be
> > > > > > >> set for
> > > > > > >> > > > > forward compatibility (so that old readers can still
> > > > > understand
> > > > > > >> the
> > > > > > >> > > > > data), but parquet.thrift says
> > > > > > >> > > > >
> > > > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > > >> > > > > true, unit = MICROS)
> > > > > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > > >> > > > > true, unit = MILLIS)
> > > > > > >> > > > > 8: TimestampType TIMESTAMP
> > > > > > >> > > > >
> > > > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > > > > >> > > > >
> > > > > > >> > > > > * Timestamp without time zone (no UTC normalization 
> > > > > > >> > > > > indicated)
> > > > > > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > > > > > >> > > > >
> > > > > > >> > > > > Prior to the introduction of LogicalType, we would set 
> > > > > > >> > > > > either
> > > > > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > > > >> > > > > normalization. So when reading the data back, any notion 
> > > > > > >> > > > > of
> > > > > > >> having had
> > > > > > >> > > > > a time zone is lost (it could be stored in schema 
> > > > > > >> > > > > metadata if
> > > > > > >> > > > > desired).
> > > > > > >> > > > >
> > > > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType 
> > > > > > >> > > > > _only_
> > > > > when
> > > > > > >> > > > > isAdjustedToUTC is true creates a forward compatibility 
> > > > > > >> > > > > break
> > > > > in
> > > > > > >> this
> > > > > > >> > > > > regard. This was reported to us shortly after releasing 
> > > > > > >> > > > > Apache
> > > > > > >> Arrow
> > > > > > >> > > > > 0.14.0:
> > > > > > >> > > > >
> > > > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > > > > >> > > > >
> > > > > > >> > > > > We are discussing setting the ConvertedType 
> > > > > > >> > > > > unconditionally in
> > > > > > >> > > > >
> > > > > > >> > > > > https://github.com/apache/arrow/pull/4825
> > > > > > >> > > > >
> > > > > > >> > > > > This might need to be a setting that is toggled when 
> > > > > > >> > > > > data is
> > > > > > >> coming
> > > > > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift 
> > > > > > >> > > > > is the
> > > > > > >> intended
> > > > > > >> > > > > forward compatibility interpretation, and if not should 
> > > > > > >> > > > > we
> > > > > amend.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > > Wes
> > > > > > >> > > > >
> > > > > > >> > >
> > > > > > >>
> > > > > > >
> > > > >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Reply via email to