Hi Wes, Sounds good to me.
Br, Zoltan On Thu, Jul 11, 2019 at 4:14 PM Wes McKinney <[email protected]> wrote: > > On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi <[email protected]> > wrote: > > > > Hi Wes, > > > > I did a little bit of testing using pyarrow 0.14.0. I know that this > > is not the latest state of the code, but my understanding is that the > > only planned change is that 0.14.1 will add the legacy types even for > > local semantics. Here is what I did: > > > > In [1]: import pandas as pd > > ...: from datetime import datetime > > ...: from pandas import Timestamp > > ...: from pytz import timezone > > ...: ts_dt = datetime(1970, 1, 1) > > ...: ts_pd = Timestamp(year=1970, month=1, day=1) > > ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1, > > tz=timezone("Europe/Paris")) > > ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1, > > hour=2, tz=timezone("Europe/Helsinki")) > > ...: df = pd.DataFrame({ > > ...: 'datetime': [ts_dt, None], > > ...: 'pd_no_tz': [ts_pd, None], > > ...: 'pd_paris': [ts_pd_paris, None], > > ...: 'pd_helsinki': [ts_pd_helsinki, None], > > ...: 'pd_mixed': [ts_pd_paris, ts_pd_helsinki] > > ...: }) > > ...: df.to_parquet('test.parquet') > > ...: df > > Out[1]: > > datetime pd_no_tz pd_paris > > pd_helsinki pd_mixed > > 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 > > 02:00:00+02:00 1970-01-01 01:00:00+01:00 > > 1 NaT NaT NaT > > NaT 1970-01-01 02:00:00+02:00 > > > > I picked these values because I expected all of these timestamps to be > > saved as the numeric value 0. Checking the metadata using > > parquet-tools: > > > > > parquet-tools meta test.parquet > > datetime: OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > pd_no_tz: OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > pd_paris: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 > > pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 > > pd_mixed: OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > > > This matched my expectations up until pd_mixed. I was surprised to see > > that timestamps with mixed time zones were be stored using local > > semantics instead of being normalized to UTC, but if it's an > > intentional choice, I'm fine with it as long as the numbers are > > correct: > > > > > parquet-tools head test.parquet > > datetime = 0 > > pd_no_tz = 0 > > pd_paris = 0 > > pd_helsinki = 0 > > pd_mixed = 3600000 > > pd_mixed = 7200000 > > > > The numbers for the mixed column are not 0, but that is just the > > result of not using UTC-normalized semantics there. The values can be > > correctly interpreted by parquet-tools: > > > > > parquet-tools dump test.parquet > > INT64 datetime > > > > -------------------------------------------------------------------------------- > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000 > > value 2: R:0 D:0 V:<null> > > > > INT64 pd_no_tz > > > > -------------------------------------------------------------------------------- > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000 > > value 2: R:0 D:0 V:<null> > > > > INT64 pd_paris > > > > -------------------------------------------------------------------------------- > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000 > > value 2: R:0 D:0 V:<null> > > > > INT64 pd_helsinki > > > > -------------------------------------------------------------------------------- > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000 > > value 2: R:0 D:0 V:<null> > > > > INT64 pd_mixed > > > > -------------------------------------------------------------------------------- > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T01:00:00.000 > > value 2: R:0 D:1 V:1970-01-01T02:00:00.000 > > > > And naturally by pandas as well: > > > > In [2]: df2 = pd.read_parquet('test.parquet') > > ...: df2 > > Out[2]: > > datetime pd_no_tz pd_paris > > pd_helsinki pd_mixed > > 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 > > 02:00:00+02:00 1970-01-01 01:00:00 > > 1 NaT NaT NaT > > NaT 1970-01-01 02:00:00 > > > > Finally, let's try an older version of parquet-tools as well: > > > > > parquet-tools meta test.parquet > > datetime: OPTIONAL INT64 R:0 D:1 > > pd_no_tz: OPTIONAL INT64 R:0 D:1 > > pd_paris: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 > > pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 > > pd_mixed: OPTIONAL INT64 R:0 D:1 > > > > Just to confirm that in pyarrow 0.13.0 (which tracks > parquet-cpp-in-development 1.6.0, prior to introduction of > LogicalType), TIMESTAMP_* is set for all 5 columns. This is the > forward compatibility that we are planning to restore. > > > This confirms that the legacy types are only written for > > UTC-normalized timestamps in 0.14.0. > > > > In summary, when saving two timestamps from different time zones that > > refer to the same instant, I would have expected pyarrow to normalize > > to UTC and thereby sacrifice the local representations instead of > > saving using local semantics and thereby sacrificing the instants. I > > don't know whether that is the intended behaviour, but in any case, > > based on this short manual testing, the new timestamp types written by > > pyarrow are interopable with the Java library. > > > > Br, > > > > Zoltan > > > > On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney <[email protected]> wrote: > > > > > > Correct > > > > > > On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <[email protected]> > > > wrote: > > > > > > > > Hi Wes, > > > > > > > > Do you mean that the new logical types have already been released in > > > > 0.14.0 > > > > and a 0.14.1 is needed ASAP to fix this regression? > > > > > > > > Thanks, > > > > > > > > Zoltan > > > > > > > > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <[email protected]> > > > > wrote: > > > > > > > > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these > > > > > issues and others we need to make a new release within the next 7-10 > > > > > days. We can point you to nightly Python builds to make testing for > > > > > you easier so you don't have to build the project yourself. > > > > > > > > > > - Wes > > > > > > > > > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi > > > > > <[email protected]> > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > Oh, and one more thing: Before releasing the next Arrow version > > > > > > incorporating the new logical types, we should definitely test that > > > > > > their > > > > > > behaviour matches that of parquet-mr. When is the next release > > > > > > planned to > > > > > > come out? > > > > > > > > > > > > Br, > > > > > > > > > > > > Zoltan > > > > > > > > > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hi Wes, > > > > > > > > > > > > > > Yes, I agree that we should do that, but then we have a problem of > > > > > what to > > > > > > > do in the other direction, i.e. when we use the new logical types > > > > > > > API > > > > > to > > > > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set > > > > > > > the UTC > > > > > > > normalized flag? Tim has started a discussion about that, > > > > > > > suggesting > > > > > to use > > > > > > > three states that I just answered. > > > > > > > > > > > > > > Br, > > > > > > > > > > > > > > Zoltan > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <[email protected]> > > > > > wrote: > > > > > > > > > > > > > >> Thank for the comments. > > > > > > >> > > > > > > >> So in summary I think that we need to set the TIMESTAMP_* > > > > > > >> converted > > > > > > >> types to maintain forward compatibility and stay consistent with > > > > > > >> what > > > > > > >> we were doing in the C++ library prior to the introduction of the > > > > > > >> LogicalType metadata. > > > > > > >> > > > > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi > > > > > > >> <[email protected] > > > > > > > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > Hi Wes, > > > > > > >> > > > > > > > >> > Both of the semantics are deterministic in one aspect and > > > > > > >> indeterministic > > > > > > >> > in another. Timestamps of instant semantic will always refer > > > > > > >> > to the > > > > > same > > > > > > >> > instant, but their user-facing representation (how they get > > > > > displayed) > > > > > > >> > depends on the user's time zone. Timestamps of local semantics > > > > > always > > > > > > >> have > > > > > > >> > the same user-facing representation but the instant they refer > > > > > > >> > to is > > > > > > >> > undefined (or ambigous, depending on your point of view). > > > > > > >> > > > > > > > >> > My understanding is that Spark uses instant semantics, i.e., > > > > > timestamps > > > > > > >> are > > > > > > >> > stored normalized to UTC and are displayed adjusted to the > > > > > > >> > user's > > > > > local > > > > > > >> > time zone. > > > > > > >> > > > > > > > >> > Br, > > > > > > >> > > > > > > > >> > Zoltan > > > > > > >> > > > > > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney > > > > > > >> > <[email protected]> > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > > Thanks Zoltan. > > > > > > >> > > > > > > > > >> > > This is definitely a tricky issue. > > > > > > >> > > > > > > > > >> > > Spark's application of localtime semantics to timestamp data > > > > > > >> > > has > > > > > been > > > > > > >> > > a source of issues for many people. Personally I don't find > > > > > > >> > > that > > > > > > >> > > behavior to be particularly helpful since depending on the > > > > > > >> > > session > > > > > > >> > > time zone, you will get different results on data not marked > > > > > > >> > > as > > > > > > >> > > UTC-normalized. > > > > > > >> > > > > > > > > >> > > In pandas, as contrast, we apply UTC semantics to > > > > > > >> > > naive/not-explicitly-normalized data so at least the code > > > > > > >> > > produces > > > > > > >> > > deterministic results on all environments. That seems > > > > > > >> > > strictly > > > > > better > > > > > > >> > > to me -- if you want a localized interpretation of naive > > > > > > >> > > data, you > > > > > > >> > > must opt into this rather than having it automatically > > > > > > >> > > selected > > > > > based > > > > > > >> > > on your locale. The instances of people shooting their toes > > > > > > >> > > off > > > > > due to > > > > > > >> > > time zones are practically non-existent, whereas I'm hearing > > > > > > >> > > about > > > > > > >> > > Spark gotchas all the time. > > > > > > >> > > > > > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > > > > > <[email protected] > > > > > > >> > > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > > >> > > > Hi Wes, > > > > > > >> > > > > > > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created > > > > > based on > > > > > > >> the > > > > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have > > > > > > >> > > > only > > > > > > >> been used > > > > > > >> > > > in the instant aka. UTC-normalized semantics so far. This > > > > > > >> assumption was > > > > > > >> > > > supported by two sources: > > > > > > >> > > > > > > > > > >> > > > 1. The specification: parquet-format defined > > > > > > >> > > > TIMESTAMP_MILLIS > > > > > and > > > > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds > > > > > > >> > > > elapsed > > > > > since > > > > > > >> the > > > > > > >> > > Unix > > > > > > >> > > > epoch, an instant specified in UTC, from which it follows > > > > > > >> > > > that > > > > > they > > > > > > >> have > > > > > > >> > > > instant semantics (because timestamps of local semantics > > > > > > >> > > > do not > > > > > > >> > > correspond > > > > > > >> > > > to a single instant). > > > > > > >> > > > > > > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software > > > > > component > > > > > > >> that > > > > > > >> > > > used these types differently from the specification. > > > > > > >> > > > > > > > > > >> > > > Based on your e-mail, we were wrong on #2. > > > > > > >> > > > > > > > > > >> > > > From this false premise it followed that TIMESTAMPs with > > > > > > >> > > > local > > > > > > >> semantics > > > > > > >> > > > were a new type and did not need to be annotated with the > > > > > > >> > > > old > > > > > types > > > > > > >> to > > > > > > >> > > > maintain compatibility. In fact, annotating them with the > > > > > > >> > > > old > > > > > types > > > > > > >> were > > > > > > >> > > > considered to be harmful, since it would have mislead older > > > > > readers > > > > > > >> into > > > > > > >> > > > thinking that they can read TIMESTAMPs with local > > > > > > >> > > > semantics, > > > > > when in > > > > > > >> > > > reality they would have misinterpreted them as TIMESTAMPs > > > > > > >> > > > with > > > > > > >> instant > > > > > > >> > > > semantics. This would have lead to a difference of several > > > > > hours, > > > > > > >> > > > corresponding to the time zone offset. > > > > > > >> > > > > > > > > > >> > > > In the light of your e-mail, this misinterpretation of > > > > > timestamps > > > > > > >> may > > > > > > >> > > > already be happening, since if Arrow annotates local > > > > > > >> > > > timestamps > > > > > with > > > > > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably > > > > > misinterprets > > > > > > >> them > > > > > > >> > > as > > > > > > >> > > > timestamps with instant semantics, leading to a difference > > > > > > >> > > > of > > > > > > >> several > > > > > > >> > > hours. > > > > > > >> > > > > > > > > > >> > > > Based on this, I think it would make sense from Arrow's > > > > > > >> > > > point of > > > > > > >> view to > > > > > > >> > > > annotate both semantics with the old types, since that is > > > > > > >> > > > its > > > > > > >> historical > > > > > > >> > > > behaviour and keeping it up is needed for maintaining > > > > > compatibilty. > > > > > > >> I'm > > > > > > >> > > not > > > > > > >> > > > so sure about the Java library though, since as far as I > > > > > > >> > > > know, > > > > > these > > > > > > >> > > types > > > > > > >> > > > were never used in the local sense there (although I may be > > > > > wrong > > > > > > >> again). > > > > > > >> > > > Were we to decide that Arrow and parquet-mr should behave > > > > > > >> differently in > > > > > > >> > > > this aspect though, it may be tricky to convey this > > > > > > >> > > > distinction > > > > > in > > > > > > >> the > > > > > > >> > > > specification. I would be interested in hearing your and > > > > > > >> > > > other > > > > > > >> > > developers' > > > > > > >> > > > opinions on this. > > > > > > >> > > > > > > > > > >> > > > Thanks, > > > > > > >> > > > > > > > > > >> > > > Zoltan > > > > > > >> > > > > > > > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney < > > > > > [email protected]> > > > > > > >> wrote: > > > > > > >> > > > > > > > > > >> > > > > hi folks, > > > > > > >> > > > > > > > > > > >> > > > > We have just recently implemented the new LogicalType > > > > > > >> > > > > unions > > > > > in > > > > > > >> the > > > > > > >> > > > > Parquet C++ library and we have run into a forward > > > > > compatibility > > > > > > >> > > > > problem with reader versions prior to this > > > > > > >> > > > > implementation. > > > > > > >> > > > > > > > > > > >> > > > > To recap the issue, prior to the introduction of > > > > > > >> > > > > LogicalType, > > > > > the > > > > > > >> > > > > Parquet format had no explicit notion of time zones or > > > > > > >> > > > > UTC > > > > > > >> > > > > normalization. The new TimestampType provides a flag to > > > > > indicate > > > > > > >> > > > > UTC-normalization > > > > > > >> > > > > > > > > > > >> > > > > struct TimestampType { > > > > > > >> > > > > 1: required bool isAdjustedToUTC > > > > > > >> > > > > 2: required TimeUnit unit > > > > > > >> > > > > } > > > > > > >> > > > > > > > > > > >> > > > > When using this new type, the ConvertedType field must > > > > > > >> > > > > also be > > > > > > >> set for > > > > > > >> > > > > forward compatibility (so that old readers can still > > > > > understand > > > > > > >> the > > > > > > >> > > > > data), but parquet.thrift says > > > > > > >> > > > > > > > > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for > > > > > > >> TIMESTAMP(isAdjustedToUTC = > > > > > > >> > > > > true, unit = MICROS) > > > > > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for > > > > > > >> TIMESTAMP(isAdjustedToUTC = > > > > > > >> > > > > true, unit = MILLIS) > > > > > > >> > > > > 8: TimestampType TIMESTAMP > > > > > > >> > > > > > > > > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps: > > > > > > >> > > > > > > > > > > >> > > > > * Timestamp without time zone (no UTC normalization > > > > > > >> > > > > indicated) > > > > > > >> > > > > * Timestamp with time zone (values UTC-normalized) > > > > > > >> > > > > > > > > > > >> > > > > Prior to the introduction of LogicalType, we would set > > > > > > >> > > > > either > > > > > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC > > > > > > >> > > > > normalization. So when reading the data back, any notion > > > > > > >> > > > > of > > > > > > >> having had > > > > > > >> > > > > a time zone is lost (it could be stored in schema > > > > > > >> > > > > metadata if > > > > > > >> > > > > desired). > > > > > > >> > > > > > > > > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType > > > > > > >> > > > > _only_ > > > > > when > > > > > > >> > > > > isAdjustedToUTC is true creates a forward compatibility > > > > > > >> > > > > break > > > > > in > > > > > > >> this > > > > > > >> > > > > regard. This was reported to us shortly after releasing > > > > > > >> > > > > Apache > > > > > > >> Arrow > > > > > > >> > > > > 0.14.0: > > > > > > >> > > > > > > > > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878 > > > > > > >> > > > > > > > > > > >> > > > > We are discussing setting the ConvertedType > > > > > > >> > > > > unconditionally in > > > > > > >> > > > > > > > > > > >> > > > > https://github.com/apache/arrow/pull/4825 > > > > > > >> > > > > > > > > > > >> > > > > This might need to be a setting that is toggled when > > > > > > >> > > > > data is > > > > > > >> coming > > > > > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift > > > > > > >> > > > > is the > > > > > > >> intended > > > > > > >> > > > > forward compatibility interpretation, and if not should > > > > > > >> > > > > we > > > > > amend. > > > > > > >> > > > > > > > > > > >> > > > > Thanks, > > > > > > >> > > > > Wes > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > >
