Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Wes McKinney Thu, 11 Jul 2019 07:15:31 -0700

On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi <[email protected]> wrote:
>
> Hi Wes,
>
> I did a little bit of testing using pyarrow 0.14.0. I know that this
> is not the latest state of the code, but my understanding is that the
> only planned change is that 0.14.1 will add the legacy types even for
> local semantics. Here is what I did:
>
>     In [1]: import pandas as pd
>        ...: from datetime import datetime
>        ...: from pandas import Timestamp
>        ...: from pytz import timezone
>        ...: ts_dt = datetime(1970, 1, 1)
>        ...: ts_pd = Timestamp(year=1970, month=1, day=1)
>        ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1,
> tz=timezone("Europe/Paris"))
>        ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1,
> hour=2, tz=timezone("Europe/Helsinki"))
>        ...: df = pd.DataFrame({
>        ...:     'datetime': [ts_dt, None],
>        ...:     'pd_no_tz': [ts_pd, None],
>        ...:     'pd_paris': [ts_pd_paris, None],
>        ...:     'pd_helsinki': [ts_pd_helsinki, None],
>        ...:     'pd_mixed': [ts_pd_paris, ts_pd_helsinki]
>        ...: })
>        ...: df.to_parquet('test.parquet')
>        ...: df
>     Out[1]:
>         datetime   pd_no_tz                  pd_paris
> pd_helsinki                   pd_mixed
>     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> 02:00:00+02:00  1970-01-01 01:00:00+01:00
>     1        NaT        NaT                       NaT
>      NaT  1970-01-01 02:00:00+02:00
>
> I picked these values because I expected all of these timestamps to be
> saved as the numeric value 0. Checking the metadata using
> parquet-tools:
>
>     > parquet-tools meta test.parquet
>     datetime:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
>     pd_no_tz:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
>     pd_paris:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
>     pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
>     pd_mixed:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
>
> This matched my expectations up until pd_mixed. I was surprised to see
> that timestamps with mixed time zones were be stored using local
> semantics instead of being normalized to UTC, but if it's an
> intentional choice, I'm fine with it as long as the numbers are
> correct:
>
>     > parquet-tools head test.parquet
>     datetime = 0
>     pd_no_tz = 0
>     pd_paris = 0
>     pd_helsinki = 0
>     pd_mixed = 3600000
>     pd_mixed = 7200000
>
> The numbers for the mixed column are not 0, but that is just the
> result of not using UTC-normalized semantics there. The values can be
> correctly interpreted by parquet-tools:
>
>     > parquet-tools dump test.parquet
>     INT64 datetime
>     
> --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_no_tz
>     
> --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_paris
>     
> --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_helsinki
>     
> --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_mixed
>     
> --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T01:00:00.000
>     value 2: R:0 D:1 V:1970-01-01T02:00:00.000
>
> And naturally by pandas as well:
>
>     In [2]: df2 = pd.read_parquet('test.parquet')
>        ...: df2
>     Out[2]:
>         datetime   pd_no_tz                  pd_paris
> pd_helsinki            pd_mixed
>     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> 02:00:00+02:00 1970-01-01 01:00:00
>     1        NaT        NaT                       NaT
>      NaT 1970-01-01 02:00:00
>
> Finally, let's try an older version of parquet-tools as well:
>
>     > parquet-tools meta test.parquet
>     datetime:    OPTIONAL INT64 R:0 D:1
>     pd_no_tz:    OPTIONAL INT64 R:0 D:1
>     pd_paris:    OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
>     pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
>     pd_mixed:    OPTIONAL INT64 R:0 D:1
>


Just to confirm that in pyarrow 0.13.0 (which tracks
parquet-cpp-in-development 1.6.0, prior to introduction of
LogicalType), TIMESTAMP_* is set for all 5 columns. This is the
forward compatibility that we are planning to restore.

> This confirms that the legacy types are only written for
> UTC-normalized timestamps in 0.14.0.
>
> In summary, when saving two timestamps from different time zones that
> refer to the same instant, I would have expected pyarrow to normalize
> to UTC and thereby sacrifice the local representations instead of
> saving using local semantics and thereby sacrificing the instants. I
> don't know whether that is the intended behaviour, but in any case,
> based on this short manual testing, the new timestamp types written by
> pyarrow are interopable with the Java library.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney <[email protected]> wrote:
> >
> > Correct
> >
> > On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <[email protected]> 
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > Do you mean that the new logical types have already been released in 
> > > 0.14.0
> > > and a 0.14.1 is needed ASAP to fix this regression?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <[email protected]> wrote:
> > >
> > > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > > > issues and others we need to make a new release within the next 7-10
> > > > days. We can point you to nightly Python builds to make testing for
> > > > you easier so you don't have to build the project yourself.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi 
> > > > <[email protected]>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Oh, and one more thing: Before releasing the next Arrow version
> > > > > incorporating the new logical types, we should definitely test that 
> > > > > their
> > > > > behaviour matches that of parquet-mr. When is the next release 
> > > > > planned to
> > > > > come out?
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <[email protected]> 
> > > > > wrote:
> > > > >
> > > > > > Hi Wes,
> > > > > >
> > > > > > Yes, I agree that we should do that, but then we have a problem of
> > > > what to
> > > > > > do in the other direction, i.e. when we use the new logical types 
> > > > > > API
> > > > to
> > > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the 
> > > > > > UTC
> > > > > > normalized flag? Tim has started a discussion about that, suggesting
> > > > to use
> > > > > > three states that I just answered.
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <[email protected]>
> > > > wrote:
> > > > > >
> > > > > >> Thank for the comments.
> > > > > >>
> > > > > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > > > > >> types to maintain forward compatibility and stay consistent with 
> > > > > >> what
> > > > > >> we were doing in the C++ library prior to the introduction of the
> > > > > >> LogicalType metadata.
> > > > > >>
> > > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi 
> > > > > >> <[email protected]
> > > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > Hi Wes,
> > > > > >> >
> > > > > >> > Both of the semantics are deterministic in one aspect and
> > > > > >> indeterministic
> > > > > >> > in another. Timestamps of instant semantic will always refer to 
> > > > > >> > the
> > > > same
> > > > > >> > instant, but their user-facing representation (how they get
> > > > displayed)
> > > > > >> > depends on the user's time zone. Timestamps of local semantics
> > > > always
> > > > > >> have
> > > > > >> > the same user-facing representation but the instant they refer 
> > > > > >> > to is
> > > > > >> > undefined (or ambigous, depending on your point of view).
> > > > > >> >
> > > > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > > > timestamps
> > > > > >> are
> > > > > >> > stored normalized to UTC and are displayed adjusted to the user's
> > > > local
> > > > > >> > time zone.
> > > > > >> >
> > > > > >> > Br,
> > > > > >> >
> > > > > >> > Zoltan
> > > > > >> >
> > > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <[email protected]>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Thanks Zoltan.
> > > > > >> > >
> > > > > >> > > This is definitely a tricky issue.
> > > > > >> > >
> > > > > >> > > Spark's application of localtime semantics to timestamp data 
> > > > > >> > > has
> > > > been
> > > > > >> > > a source of issues for many people. Personally I don't find 
> > > > > >> > > that
> > > > > >> > > behavior to be particularly helpful since depending on the 
> > > > > >> > > session
> > > > > >> > > time zone, you will get different results on data not marked as
> > > > > >> > > UTC-normalized.
> > > > > >> > >
> > > > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > > > >> > > naive/not-explicitly-normalized data so at least the code 
> > > > > >> > > produces
> > > > > >> > > deterministic results on all environments. That seems strictly
> > > > better
> > > > > >> > > to me -- if you want a localized interpretation of naive data, 
> > > > > >> > > you
> > > > > >> > > must opt into this rather than having it automatically selected
> > > > based
> > > > > >> > > on your locale. The instances of people shooting their toes off
> > > > due to
> > > > > >> > > time zones are practically non-existent, whereas I'm hearing 
> > > > > >> > > about
> > > > > >> > > Spark gotchas all the time.
> > > > > >> > >
> > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> > > > <[email protected]
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > Hi Wes,
> > > > > >> > > >
> > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > > > based on
> > > > > >> the
> > > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have 
> > > > > >> > > > only
> > > > > >> been used
> > > > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > > > >> assumption was
> > > > > >> > > > supported by two sources:
> > > > > >> > > >
> > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> > > > and
> > > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> > > > since
> > > > > >> the
> > > > > >> > > Unix
> > > > > >> > > > epoch, an instant specified in UTC, from which it follows 
> > > > > >> > > > that
> > > > they
> > > > > >> have
> > > > > >> > > > instant semantics (because timestamps of local semantics do 
> > > > > >> > > > not
> > > > > >> > > correspond
> > > > > >> > > > to a single instant).
> > > > > >> > > >
> > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> > > > component
> > > > > >> that
> > > > > >> > > > used these types differently from the specification.
> > > > > >> > > >
> > > > > >> > > > Based on your e-mail, we were wrong on #2.
> > > > > >> > > >
> > > > > >> > > > From this false premise it followed that TIMESTAMPs with 
> > > > > >> > > > local
> > > > > >> semantics
> > > > > >> > > > were a new type and did not need to be annotated with the old
> > > > types
> > > > > >> to
> > > > > >> > > > maintain compatibility. In fact, annotating them with the old
> > > > types
> > > > > >> were
> > > > > >> > > > considered to be harmful, since it would have mislead older
> > > > readers
> > > > > >> into
> > > > > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> > > > when in
> > > > > >> > > > reality they would have misinterpreted them as TIMESTAMPs 
> > > > > >> > > > with
> > > > > >> instant
> > > > > >> > > > semantics. This would have lead to a difference of several
> > > > hours,
> > > > > >> > > > corresponding to the time zone offset.
> > > > > >> > > >
> > > > > >> > > > In the light of your e-mail, this misinterpretation of
> > > > timestamps
> > > > > >> may
> > > > > >> > > > already be happening, since if Arrow annotates local 
> > > > > >> > > > timestamps
> > > > with
> > > > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> > > > misinterprets
> > > > > >> them
> > > > > >> > > as
> > > > > >> > > > timestamps with instant semantics, leading to a difference of
> > > > > >> several
> > > > > >> > > hours.
> > > > > >> > > >
> > > > > >> > > > Based on this, I think it would make sense from Arrow's 
> > > > > >> > > > point of
> > > > > >> view to
> > > > > >> > > > annotate both semantics with the old types, since that is its
> > > > > >> historical
> > > > > >> > > > behaviour and keeping it up is needed for maintaining
> > > > compatibilty.
> > > > > >> I'm
> > > > > >> > > not
> > > > > >> > > > so sure about the Java library though, since as far as I 
> > > > > >> > > > know,
> > > > these
> > > > > >> > > types
> > > > > >> > > > were never used in the local sense there (although I may be
> > > > wrong
> > > > > >> again).
> > > > > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > > > > >> differently in
> > > > > >> > > > this aspect though, it may be tricky to convey this 
> > > > > >> > > > distinction
> > > > in
> > > > > >> the
> > > > > >> > > > specification. I would be interested in hearing your and 
> > > > > >> > > > other
> > > > > >> > > developers'
> > > > > >> > > > opinions on this.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Zoltan
> > > > > >> > > >
> > > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> > > > [email protected]>
> > > > > >> wrote:
> > > > > >> > > >
> > > > > >> > > > > hi folks,
> > > > > >> > > > >
> > > > > >> > > > > We have just recently implemented the new LogicalType 
> > > > > >> > > > > unions
> > > > in
> > > > > >> the
> > > > > >> > > > > Parquet C++ library and we have run into a forward
> > > > compatibility
> > > > > >> > > > > problem with reader versions prior to this implementation.
> > > > > >> > > > >
> > > > > >> > > > > To recap the issue, prior to the introduction of 
> > > > > >> > > > > LogicalType,
> > > > the
> > > > > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > > > > >> > > > > normalization. The new TimestampType provides a flag to
> > > > indicate
> > > > > >> > > > > UTC-normalization
> > > > > >> > > > >
> > > > > >> > > > > struct TimestampType {
> > > > > >> > > > > 1: required bool isAdjustedToUTC
> > > > > >> > > > > 2: required TimeUnit unit
> > > > > >> > > > > }
> > > > > >> > > > >
> > > > > >> > > > > When using this new type, the ConvertedType field must 
> > > > > >> > > > > also be
> > > > > >> set for
> > > > > >> > > > > forward compatibility (so that old readers can still
> > > > understand
> > > > > >> the
> > > > > >> > > > > data), but parquet.thrift says
> > > > > >> > > > >
> > > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > >> > > > > true, unit = MICROS)
> > > > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > >> > > > > true, unit = MILLIS)
> > > > > >> > > > > 8: TimestampType TIMESTAMP
> > > > > >> > > > >
> > > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > > > >> > > > >
> > > > > >> > > > > * Timestamp without time zone (no UTC normalization 
> > > > > >> > > > > indicated)
> > > > > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > > > > >> > > > >
> > > > > >> > > > > Prior to the introduction of LogicalType, we would set 
> > > > > >> > > > > either
> > > > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > > >> > > > > normalization. So when reading the data back, any notion of
> > > > > >> having had
> > > > > >> > > > > a time zone is lost (it could be stored in schema metadata 
> > > > > >> > > > > if
> > > > > >> > > > > desired).
> > > > > >> > > > >
> > > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> > > > when
> > > > > >> > > > > isAdjustedToUTC is true creates a forward compatibility 
> > > > > >> > > > > break
> > > > in
> > > > > >> this
> > > > > >> > > > > regard. This was reported to us shortly after releasing 
> > > > > >> > > > > Apache
> > > > > >> Arrow
> > > > > >> > > > > 0.14.0:
> > > > > >> > > > >
> > > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > > > >> > > > >
> > > > > >> > > > > We are discussing setting the ConvertedType 
> > > > > >> > > > > unconditionally in
> > > > > >> > > > >
> > > > > >> > > > > https://github.com/apache/arrow/pull/4825
> > > > > >> > > > >
> > > > > >> > > > > This might need to be a setting that is toggled when data 
> > > > > >> > > > > is
> > > > > >> coming
> > > > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is 
> > > > > >> > > > > the
> > > > > >> intended
> > > > > >> > > > > forward compatibility interpretation, and if not should we
> > > > amend.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > > Wes
> > > > > >> > > > >
> > > > > >> > >
> > > > > >>
> > > > > >
> > > >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Reply via email to