Thank for the comments.

So in summary I think that we need to set the TIMESTAMP_* converted
types to maintain forward compatibility and stay consistent with what
we were doing in the C++ library prior to the introduction of the
LogicalType metadata.

On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <[email protected]> wrote:
>
> Hi Wes,
>
> Both of the semantics are deterministic in one aspect and indeterministic
> in another. Timestamps of instant semantic will always refer to the same
> instant, but their user-facing representation (how they get displayed)
> depends on the user's time zone. Timestamps of local semantics always have
> the same user-facing representation but the instant they refer to is
> undefined (or ambigous, depending on your point of view).
>
> My understanding is that Spark uses instant semantics, i.e., timestamps are
> stored normalized to UTC and are displayed adjusted to the user's local
> time zone.
>
> Br,
>
> Zoltan
>
> On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <[email protected]> wrote:
>
> > Thanks Zoltan.
> >
> > This is definitely a tricky issue.
> >
> > Spark's application of localtime semantics to timestamp data has been
> > a source of issues for many people. Personally I don't find that
> > behavior to be particularly helpful since depending on the session
> > time zone, you will get different results on data not marked as
> > UTC-normalized.
> >
> > In pandas, as contrast, we apply UTC semantics to
> > naive/not-explicitly-normalized data so at least the code produces
> > deterministic results on all environments. That seems strictly better
> > to me -- if you want a localized interpretation of naive data, you
> > must opt into this rather than having it automatically selected based
> > on your locale. The instances of people shooting their toes off due to
> > time zones are practically non-existent, whereas I'm hearing about
> > Spark gotchas all the time.
> >
> > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <[email protected]>
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > The rules for TIMESTAMP forward-compatibility were created based on the
> > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
> > > in the instant aka. UTC-normalized semantics so far. This assumption was
> > > supported by two sources:
> > >
> > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the
> > Unix
> > > epoch, an instant specified in UTC, from which it follows that they have
> > > instant semantics (because timestamps of local semantics do not
> > correspond
> > > to a single instant).
> > >
> > > 2. Anecdotal knowledge: We were not aware of any software component that
> > > used these types differently from the specification.
> > >
> > > Based on your e-mail, we were wrong on #2.
> > >
> > > From this false premise it followed that TIMESTAMPs with local semantics
> > > were a new type and did not need to be annotated with the old types to
> > > maintain compatibility. In fact, annotating them with the old types were
> > > considered to be harmful, since it would have mislead older readers into
> > > thinking that they can read TIMESTAMPs with local semantics, when in
> > > reality they would have misinterpreted them as TIMESTAMPs with instant
> > > semantics. This would have lead to a difference of several hours,
> > > corresponding to the time zone offset.
> > >
> > > In the light of your e-mail, this misinterpretation of timestamps may
> > > already be happening, since if Arrow annotates local timestamps with
> > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them
> > as
> > > timestamps with instant semantics, leading to a difference of several
> > hours.
> > >
> > > Based on this, I think it would make sense from Arrow's point of view to
> > > annotate both semantics with the old types, since that is its historical
> > > behaviour and keeping it up is needed for maintaining compatibilty. I'm
> > not
> > > so sure about the Java library though, since as far as I know, these
> > types
> > > were never used in the local sense there (although I may be wrong again).
> > > Were we to decide that Arrow and parquet-mr should behave differently in
> > > this aspect though, it may be tricky to convey this distinction in the
> > > specification. I would be interested in hearing your and other
> > developers'
> > > opinions on this.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <[email protected]> wrote:
> > >
> > > > hi folks,
> > > >
> > > > We have just recently implemented the new LogicalType unions in the
> > > > Parquet C++ library and we have run into a forward compatibility
> > > > problem with reader versions prior to this implementation.
> > > >
> > > > To recap the issue, prior to the introduction of LogicalType, the
> > > > Parquet format had no explicit notion of time zones or UTC
> > > > normalization. The new TimestampType provides a flag to indicate
> > > > UTC-normalization
> > > >
> > > > struct TimestampType {
> > > > 1: required bool isAdjustedToUTC
> > > > 2: required TimeUnit unit
> > > > }
> > > >
> > > > When using this new type, the ConvertedType field must also be set for
> > > > forward compatibility (so that old readers can still understand the
> > > > data), but parquet.thrift says
> > > >
> > > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
> > > > true, unit = MICROS)
> > > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
> > > > true, unit = MILLIS)
> > > > 8: TimestampType TIMESTAMP
> > > >
> > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > >
> > > > * Timestamp without time zone (no UTC normalization indicated)
> > > > * Timestamp with time zone (values UTC-normalized)
> > > >
> > > > Prior to the introduction of LogicalType, we would set either
> > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > normalization. So when reading the data back, any notion of having had
> > > > a time zone is lost (it could be stored in schema metadata if
> > > > desired).
> > > >
> > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> > > > isAdjustedToUTC is true creates a forward compatibility break in this
> > > > regard. This was reported to us shortly after releasing Apache Arrow
> > > > 0.14.0:
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > >
> > > > We are discussing setting the ConvertedType unconditionally in
> > > >
> > > > https://github.com/apache/arrow/pull/4825
> > > >
> > > > This might need to be a setting that is toggled when data is coming
> > > > from Arrow, but I wonder if the text in parquet.thrift is the intended
> > > > forward compatibility interpretation, and if not should we amend.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> >

Reply via email to