Hi Wes, Do you mean that the new logical types have already been released in 0.14.0 and a 0.14.1 is needed ASAP to fix this regression?
Thanks, Zoltan On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these > issues and others we need to make a new release within the next 7-10 > days. We can point you to nightly Python builds to make testing for > you easier so you don't have to build the project yourself. > > - Wes > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <z...@cloudera.com.invalid> > wrote: > > > > Hi, > > > > Oh, and one more thing: Before releasing the next Arrow version > > incorporating the new logical types, we should definitely test that their > > behaviour matches that of parquet-mr. When is the next release planned to > > come out? > > > > Br, > > > > Zoltan > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <z...@cloudera.com> wrote: > > > > > Hi Wes, > > > > > > Yes, I agree that we should do that, but then we have a problem of > what to > > > do in the other direction, i.e. when we use the new logical types API > to > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC > > > normalized flag? Tim has started a discussion about that, suggesting > to use > > > three states that I just answered. > > > > > > Br, > > > > > > Zoltan > > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > >> Thank for the comments. > > >> > > >> So in summary I think that we need to set the TIMESTAMP_* converted > > >> types to maintain forward compatibility and stay consistent with what > > >> we were doing in the C++ library prior to the introduction of the > > >> LogicalType metadata. > > >> > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <z...@cloudera.com.invalid > > > > >> wrote: > > >> > > > >> > Hi Wes, > > >> > > > >> > Both of the semantics are deterministic in one aspect and > > >> indeterministic > > >> > in another. Timestamps of instant semantic will always refer to the > same > > >> > instant, but their user-facing representation (how they get > displayed) > > >> > depends on the user's time zone. Timestamps of local semantics > always > > >> have > > >> > the same user-facing representation but the instant they refer to is > > >> > undefined (or ambigous, depending on your point of view). > > >> > > > >> > My understanding is that Spark uses instant semantics, i.e., > timestamps > > >> are > > >> > stored normalized to UTC and are displayed adjusted to the user's > local > > >> > time zone. > > >> > > > >> > Br, > > >> > > > >> > Zoltan > > >> > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <wesmck...@gmail.com> > > >> wrote: > > >> > > > >> > > Thanks Zoltan. > > >> > > > > >> > > This is definitely a tricky issue. > > >> > > > > >> > > Spark's application of localtime semantics to timestamp data has > been > > >> > > a source of issues for many people. Personally I don't find that > > >> > > behavior to be particularly helpful since depending on the session > > >> > > time zone, you will get different results on data not marked as > > >> > > UTC-normalized. > > >> > > > > >> > > In pandas, as contrast, we apply UTC semantics to > > >> > > naive/not-explicitly-normalized data so at least the code produces > > >> > > deterministic results on all environments. That seems strictly > better > > >> > > to me -- if you want a localized interpretation of naive data, you > > >> > > must opt into this rather than having it automatically selected > based > > >> > > on your locale. The instances of people shooting their toes off > due to > > >> > > time zones are practically non-existent, whereas I'm hearing about > > >> > > Spark gotchas all the time. > > >> > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > <z...@cloudera.com.invalid > > >> > > > >> > > wrote: > > >> > > > > > >> > > > Hi Wes, > > >> > > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created > based on > > >> the > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only > > >> been used > > >> > > > in the instant aka. UTC-normalized semantics so far. This > > >> assumption was > > >> > > > supported by two sources: > > >> > > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS > and > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed > since > > >> the > > >> > > Unix > > >> > > > epoch, an instant specified in UTC, from which it follows that > they > > >> have > > >> > > > instant semantics (because timestamps of local semantics do not > > >> > > correspond > > >> > > > to a single instant). > > >> > > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software > component > > >> that > > >> > > > used these types differently from the specification. > > >> > > > > > >> > > > Based on your e-mail, we were wrong on #2. > > >> > > > > > >> > > > From this false premise it followed that TIMESTAMPs with local > > >> semantics > > >> > > > were a new type and did not need to be annotated with the old > types > > >> to > > >> > > > maintain compatibility. In fact, annotating them with the old > types > > >> were > > >> > > > considered to be harmful, since it would have mislead older > readers > > >> into > > >> > > > thinking that they can read TIMESTAMPs with local semantics, > when in > > >> > > > reality they would have misinterpreted them as TIMESTAMPs with > > >> instant > > >> > > > semantics. This would have lead to a difference of several > hours, > > >> > > > corresponding to the time zone offset. > > >> > > > > > >> > > > In the light of your e-mail, this misinterpretation of > timestamps > > >> may > > >> > > > already be happening, since if Arrow annotates local timestamps > with > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably > misinterprets > > >> them > > >> > > as > > >> > > > timestamps with instant semantics, leading to a difference of > > >> several > > >> > > hours. > > >> > > > > > >> > > > Based on this, I think it would make sense from Arrow's point of > > >> view to > > >> > > > annotate both semantics with the old types, since that is its > > >> historical > > >> > > > behaviour and keeping it up is needed for maintaining > compatibilty. > > >> I'm > > >> > > not > > >> > > > so sure about the Java library though, since as far as I know, > these > > >> > > types > > >> > > > were never used in the local sense there (although I may be > wrong > > >> again). > > >> > > > Were we to decide that Arrow and parquet-mr should behave > > >> differently in > > >> > > > this aspect though, it may be tricky to convey this distinction > in > > >> the > > >> > > > specification. I would be interested in hearing your and other > > >> > > developers' > > >> > > > opinions on this. > > >> > > > > > >> > > > Thanks, > > >> > > > > > >> > > > Zoltan > > >> > > > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney < > wesmck...@gmail.com> > > >> wrote: > > >> > > > > > >> > > > > hi folks, > > >> > > > > > > >> > > > > We have just recently implemented the new LogicalType unions > in > > >> the > > >> > > > > Parquet C++ library and we have run into a forward > compatibility > > >> > > > > problem with reader versions prior to this implementation. > > >> > > > > > > >> > > > > To recap the issue, prior to the introduction of LogicalType, > the > > >> > > > > Parquet format had no explicit notion of time zones or UTC > > >> > > > > normalization. The new TimestampType provides a flag to > indicate > > >> > > > > UTC-normalization > > >> > > > > > > >> > > > > struct TimestampType { > > >> > > > > 1: required bool isAdjustedToUTC > > >> > > > > 2: required TimeUnit unit > > >> > > > > } > > >> > > > > > > >> > > > > When using this new type, the ConvertedType field must also be > > >> set for > > >> > > > > forward compatibility (so that old readers can still > understand > > >> the > > >> > > > > data), but parquet.thrift says > > >> > > > > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for > > >> TIMESTAMP(isAdjustedToUTC = > > >> > > > > true, unit = MICROS) > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for > > >> TIMESTAMP(isAdjustedToUTC = > > >> > > > > true, unit = MILLIS) > > >> > > > > 8: TimestampType TIMESTAMP > > >> > > > > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps: > > >> > > > > > > >> > > > > * Timestamp without time zone (no UTC normalization indicated) > > >> > > > > * Timestamp with time zone (values UTC-normalized) > > >> > > > > > > >> > > > > Prior to the introduction of LogicalType, we would set either > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC > > >> > > > > normalization. So when reading the data back, any notion of > > >> having had > > >> > > > > a time zone is lost (it could be stored in schema metadata if > > >> > > > > desired). > > >> > > > > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ > when > > >> > > > > isAdjustedToUTC is true creates a forward compatibility break > in > > >> this > > >> > > > > regard. This was reported to us shortly after releasing Apache > > >> Arrow > > >> > > > > 0.14.0: > > >> > > > > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878 > > >> > > > > > > >> > > > > We are discussing setting the ConvertedType unconditionally in > > >> > > > > > > >> > > > > https://github.com/apache/arrow/pull/4825 > > >> > > > > > > >> > > > > This might need to be a setting that is toggled when data is > > >> coming > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the > > >> intended > > >> > > > > forward compatibility interpretation, and if not should we > amend. > > >> > > > > > > >> > > > > Thanks, > > >> > > > > Wes > > >> > > > > > > >> > > > > >> > > > >