Hi Wes, Both of the semantics are deterministic in one aspect and indeterministic in another. Timestamps of instant semantic will always refer to the same instant, but their user-facing representation (how they get displayed) depends on the user's time zone. Timestamps of local semantics always have the same user-facing representation but the instant they refer to is undefined (or ambigous, depending on your point of view).
My understanding is that Spark uses instant semantics, i.e., timestamps are stored normalized to UTC and are displayed adjusted to the user's local time zone. Br, Zoltan On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <[email protected]> wrote: > Thanks Zoltan. > > This is definitely a tricky issue. > > Spark's application of localtime semantics to timestamp data has been > a source of issues for many people. Personally I don't find that > behavior to be particularly helpful since depending on the session > time zone, you will get different results on data not marked as > UTC-normalized. > > In pandas, as contrast, we apply UTC semantics to > naive/not-explicitly-normalized data so at least the code produces > deterministic results on all environments. That seems strictly better > to me -- if you want a localized interpretation of naive data, you > must opt into this rather than having it automatically selected based > on your locale. The instances of people shooting their toes off due to > time zones are practically non-existent, whereas I'm hearing about > Spark gotchas all the time. > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <[email protected]> > wrote: > > > > Hi Wes, > > > > The rules for TIMESTAMP forward-compatibility were created based on the > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used > > in the instant aka. UTC-normalized semantics so far. This assumption was > > supported by two sources: > > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the > Unix > > epoch, an instant specified in UTC, from which it follows that they have > > instant semantics (because timestamps of local semantics do not > correspond > > to a single instant). > > > > 2. Anecdotal knowledge: We were not aware of any software component that > > used these types differently from the specification. > > > > Based on your e-mail, we were wrong on #2. > > > > From this false premise it followed that TIMESTAMPs with local semantics > > were a new type and did not need to be annotated with the old types to > > maintain compatibility. In fact, annotating them with the old types were > > considered to be harmful, since it would have mislead older readers into > > thinking that they can read TIMESTAMPs with local semantics, when in > > reality they would have misinterpreted them as TIMESTAMPs with instant > > semantics. This would have lead to a difference of several hours, > > corresponding to the time zone offset. > > > > In the light of your e-mail, this misinterpretation of timestamps may > > already be happening, since if Arrow annotates local timestamps with > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them > as > > timestamps with instant semantics, leading to a difference of several > hours. > > > > Based on this, I think it would make sense from Arrow's point of view to > > annotate both semantics with the old types, since that is its historical > > behaviour and keeping it up is needed for maintaining compatibilty. I'm > not > > so sure about the Java library though, since as far as I know, these > types > > were never used in the local sense there (although I may be wrong again). > > Were we to decide that Arrow and parquet-mr should behave differently in > > this aspect though, it may be tricky to convey this distinction in the > > specification. I would be interested in hearing your and other > developers' > > opinions on this. > > > > Thanks, > > > > Zoltan > > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <[email protected]> wrote: > > > > > hi folks, > > > > > > We have just recently implemented the new LogicalType unions in the > > > Parquet C++ library and we have run into a forward compatibility > > > problem with reader versions prior to this implementation. > > > > > > To recap the issue, prior to the introduction of LogicalType, the > > > Parquet format had no explicit notion of time zones or UTC > > > normalization. The new TimestampType provides a flag to indicate > > > UTC-normalization > > > > > > struct TimestampType { > > > 1: required bool isAdjustedToUTC > > > 2: required TimeUnit unit > > > } > > > > > > When using this new type, the ConvertedType field must also be set for > > > forward compatibility (so that old readers can still understand the > > > data), but parquet.thrift says > > > > > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = > > > true, unit = MICROS) > > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = > > > true, unit = MILLIS) > > > 8: TimestampType TIMESTAMP > > > > > > In Apache Arrow, we have 2 varieties of timestamps: > > > > > > * Timestamp without time zone (no UTC normalization indicated) > > > * Timestamp with time zone (values UTC-normalized) > > > > > > Prior to the introduction of LogicalType, we would set either > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC > > > normalization. So when reading the data back, any notion of having had > > > a time zone is lost (it could be stored in schema metadata if > > > desired). > > > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when > > > isAdjustedToUTC is true creates a forward compatibility break in this > > > regard. This was reported to us shortly after releasing Apache Arrow > > > 0.14.0: > > > > > > https://issues.apache.org/jira/browse/ARROW-5878 > > > > > > We are discussing setting the ConvertedType unconditionally in > > > > > > https://github.com/apache/arrow/pull/4825 > > > > > > This might need to be a setting that is toggled when data is coming > > > from Arrow, but I wonder if the text in parquet.thrift is the intended > > > forward compatibility interpretation, and if not should we amend. > > > > > > Thanks, > > > Wes > > > >
