Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Zoltan, Thank you for your engagement on this issue, it's much appreciated. The upcoming release of the parquet-cpp library takes essentially the approach you outline. When a Timestamp logical type is created a from a legacy converted TIMESTAMP_* value (your third case), the Timestamp's UTC-normalized property is set to "true" (per the backward compatibility requirements in the spec) but an additional property indicating that the LogicalType was derived from a ConvertedType is also set to "true" and exposed via an accessor. So a library user can choose to accept the default UTC normalization inference or override it by checking the new flag. Similarly, for forward compatibility, the default behavior is to produce a TIMESTAMP_* converted type only if the Timestamp logical type is UTC normalized (again conforming to the spec), but the Timestamp LogicalType API allows the library user to force a TIMESTAMP converted type even for non-UTC normalized Timestamps. Given these API options, I tend to agree with you that the need for file format changes is substantially reduced or eliminated. Thanks, Tim On Thu, Jul 11, 2019 at 6:41 AM Zoltan Ivanfi wrote: > Hi Tim, > > I was just told that arrow 0.14.0 is already out and uses the new > logical types. This means that your original suggestion of changing > the boolean to an enum is not really a viable option. Regarding your > second suggestion, an additional field may be added to parquet-format, > but I'm still concerned that it would be confusing and that it would > provide little value in practice. However, thinking more about the > problem I realized that we do not have to add anything to > parquet-format to record the uncertainty of timestamp semantics, since > we already have a way for doing so by using the legacy timestamp > types. It is enough to expose the possiblity of having undefined > semantics in the API. This can be deduced from the logical types > stored in the file in the following way: > > Logical type in file => Logical type in API > --- > new TS type with semantics A => TS with semantics A > new TS type with semantics B => TS with semantics B > legacy TS type=> TS with unknown semantics > > For example, TIMESTAMP_MICROS would map to unknown semantics in the > API while TIMESTAMP(isAdjustedToUTC=true, unit=MICROS) would map to > instant semantics. Please let me know your thoughts on this > suggestion. > > Thanks, > > Zoltan > > On Wed, Jul 10, 2019 at 6:43 PM TP Boudreau wrote: > > > > Sorry for the quick self-reply, but after brief reflection I think two > > changes to my alternative proposal are required: > > > > 1. The proposed new field should be a parameter to the TimestampType, > not > > FileMetaData -- file level adds unnecessary complication / opportunities > > for mischief. > > 2. Although reported vs. inferred is the logical distinction, > practically > > this change is about whether or not the TimestampType was built from a > > TIMESTAMP converted type, so the name should reflect that. > > > > After these amendments, this option boils down to: add a new boolean > > parameter to the TimestampType named (something like) fromConvertedType. > > > > --TPB > > > > On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau > wrote: > > > > > Hi Zoltan, > > > > > > Thank you for the helpful clarification of the community's > understanding > > > of the TIMESTAMP annotation. > > > > > > The core of the problem (IMHO) is that there no way to distinguish in > the > > > new LogicalType TimestampType between the case where UTC-normalization > has > > > been directly reported (via a user supplied TimestampType boolean > > > parameter) or merely inferred (from a naked TIMESTAMP converted > type). So > > > perhaps another alternative might be to retain the isAdjustedToUTC > boolean > > > as is and add another field that indicates whether the adjusted flag > was > > > REPORTED or INFERRED (could be boolean or short variant or some type). > > > This would allow interested readers to differentiate between old and > new > > > timestamps, while allowing other readers to enjoy the default you > believe > > > is warranted. It seems most straightforward that it would be an > additional > > > parameter on the TimestampType, but I supposed it could reside in the > > > FileMetaData struct (on the assumption that the schema elements, having > > > been written by the same writer, all uniformly use converted type or > > > LogicalType). > > > > > > --Tim > > > > > > > > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi > > > > wrote: > > > > > >> Hi Tim, > > >> > > >> In my opinion the specification of the older timestamp types only > allowed > > >> UTC-normalized storage, since these types were defined as the number > of > > >> milli/microseconds elapsed since the Unix epoch. This clearly defines > the > > >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Wes, Sounds good to me. Br, Zoltan On Thu, Jul 11, 2019 at 4:14 PM Wes McKinney wrote: > > On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi > wrote: > > > > Hi Wes, > > > > I did a little bit of testing using pyarrow 0.14.0. I know that this > > is not the latest state of the code, but my understanding is that the > > only planned change is that 0.14.1 will add the legacy types even for > > local semantics. Here is what I did: > > > > In [1]: import pandas as pd > >...: from datetime import datetime > >...: from pandas import Timestamp > >...: from pytz import timezone > >...: ts_dt = datetime(1970, 1, 1) > >...: ts_pd = Timestamp(year=1970, month=1, day=1) > >...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1, > > tz=timezone("Europe/Paris")) > >...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1, > > hour=2, tz=timezone("Europe/Helsinki")) > >...: df = pd.DataFrame({ > >...: 'datetime': [ts_dt, None], > >...: 'pd_no_tz': [ts_pd, None], > >...: 'pd_paris': [ts_pd_paris, None], > >...: 'pd_helsinki': [ts_pd_helsinki, None], > >...: 'pd_mixed': [ts_pd_paris, ts_pd_helsinki] > >...: }) > >...: df.to_parquet('test.parquet') > >...: df > > Out[1]: > > datetime pd_no_tz pd_paris > > pd_helsinki pd_mixed > > 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 > > 02:00:00+02:00 1970-01-01 01:00:00+01:00 > > 1NaTNaT NaT > > NaT 1970-01-01 02:00:00+02:00 > > > > I picked these values because I expected all of these timestamps to be > > saved as the numeric value 0. Checking the metadata using > > parquet-tools: > > > > > parquet-tools meta test.parquet > > datetime:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > pd_no_tz:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > pd_paris:OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 > > pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 > > pd_mixed:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > > > This matched my expectations up until pd_mixed. I was surprised to see > > that timestamps with mixed time zones were be stored using local > > semantics instead of being normalized to UTC, but if it's an > > intentional choice, I'm fine with it as long as the numbers are > > correct: > > > > > parquet-tools head test.parquet > > datetime = 0 > > pd_no_tz = 0 > > pd_paris = 0 > > pd_helsinki = 0 > > pd_mixed = 360 > > pd_mixed = 720 > > > > The numbers for the mixed column are not 0, but that is just the > > result of not using UTC-normalized semantics there. The values can be > > correctly interpreted by parquet-tools: > > > > > parquet-tools dump test.parquet > > INT64 datetime > > > > > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000 > > value 2: R:0 D:0 V: > > > > INT64 pd_no_tz > > > > > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000 > > value 2: R:0 D:0 V: > > > > INT64 pd_paris > > > > > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000+ > > value 2: R:0 D:0 V: > > > > INT64 pd_helsinki > > > > > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T00:00:00.000+ > > value 2: R:0 D:0 V: > > > > INT64 pd_mixed > > > > > > *** row group 1 of 1, values 1 to 2 *** > > value 1: R:0 D:1 V:1970-01-01T01:00:00.000 > > value 2: R:0 D:1 V:1970-01-01T02:00:00.000 > > > > And naturally by pandas as well: > > > > In [2]: df2 = pd.read_parquet('test.parquet') > >...: df2 > > Out[2]: > > datetime pd_no_tz pd_paris > > pd_helsinkipd_mixed > > 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 > > 02:00:00+02:00 1970-01-01 01:00:00 > > 1NaTNaT NaT > > NaT 1970-01-01 02:00:00 > > > > Finally, let's try an older version of parquet-tools as well: > > > > > parquet-tools meta test.parquet > > datetime:OPTIONAL INT64 R:0 D:1 > > pd_no_tz:OPTIONAL INT64 R:0 D:1 > > pd_paris:OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 > > pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 > > pd_mixed:OPTIONAL INT64 R:0
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi wrote: > > Hi Wes, > > I did a little bit of testing using pyarrow 0.14.0. I know that this > is not the latest state of the code, but my understanding is that the > only planned change is that 0.14.1 will add the legacy types even for > local semantics. Here is what I did: > > In [1]: import pandas as pd >...: from datetime import datetime >...: from pandas import Timestamp >...: from pytz import timezone >...: ts_dt = datetime(1970, 1, 1) >...: ts_pd = Timestamp(year=1970, month=1, day=1) >...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1, > tz=timezone("Europe/Paris")) >...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1, > hour=2, tz=timezone("Europe/Helsinki")) >...: df = pd.DataFrame({ >...: 'datetime': [ts_dt, None], >...: 'pd_no_tz': [ts_pd, None], >...: 'pd_paris': [ts_pd_paris, None], >...: 'pd_helsinki': [ts_pd_helsinki, None], >...: 'pd_mixed': [ts_pd_paris, ts_pd_helsinki] >...: }) >...: df.to_parquet('test.parquet') >...: df > Out[1]: > datetime pd_no_tz pd_paris > pd_helsinki pd_mixed > 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 > 02:00:00+02:00 1970-01-01 01:00:00+01:00 > 1NaTNaT NaT > NaT 1970-01-01 02:00:00+02:00 > > I picked these values because I expected all of these timestamps to be > saved as the numeric value 0. Checking the metadata using > parquet-tools: > > > parquet-tools meta test.parquet > datetime:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > pd_no_tz:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > pd_paris:OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 > pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 > pd_mixed:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 > > This matched my expectations up until pd_mixed. I was surprised to see > that timestamps with mixed time zones were be stored using local > semantics instead of being normalized to UTC, but if it's an > intentional choice, I'm fine with it as long as the numbers are > correct: > > > parquet-tools head test.parquet > datetime = 0 > pd_no_tz = 0 > pd_paris = 0 > pd_helsinki = 0 > pd_mixed = 360 > pd_mixed = 720 > > The numbers for the mixed column are not 0, but that is just the > result of not using UTC-normalized semantics there. The values can be > correctly interpreted by parquet-tools: > > > parquet-tools dump test.parquet > INT64 datetime > > > *** row group 1 of 1, values 1 to 2 *** > value 1: R:0 D:1 V:1970-01-01T00:00:00.000 > value 2: R:0 D:0 V: > > INT64 pd_no_tz > > > *** row group 1 of 1, values 1 to 2 *** > value 1: R:0 D:1 V:1970-01-01T00:00:00.000 > value 2: R:0 D:0 V: > > INT64 pd_paris > > > *** row group 1 of 1, values 1 to 2 *** > value 1: R:0 D:1 V:1970-01-01T00:00:00.000+ > value 2: R:0 D:0 V: > > INT64 pd_helsinki > > > *** row group 1 of 1, values 1 to 2 *** > value 1: R:0 D:1 V:1970-01-01T00:00:00.000+ > value 2: R:0 D:0 V: > > INT64 pd_mixed > > > *** row group 1 of 1, values 1 to 2 *** > value 1: R:0 D:1 V:1970-01-01T01:00:00.000 > value 2: R:0 D:1 V:1970-01-01T02:00:00.000 > > And naturally by pandas as well: > > In [2]: df2 = pd.read_parquet('test.parquet') >...: df2 > Out[2]: > datetime pd_no_tz pd_paris > pd_helsinkipd_mixed > 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 > 02:00:00+02:00 1970-01-01 01:00:00 > 1NaTNaT NaT > NaT 1970-01-01 02:00:00 > > Finally, let's try an older version of parquet-tools as well: > > > parquet-tools meta test.parquet > datetime:OPTIONAL INT64 R:0 D:1 > pd_no_tz:OPTIONAL INT64 R:0 D:1 > pd_paris:OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 > pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 > pd_mixed:OPTIONAL INT64 R:0 D:1 > Just to confirm that in pyarrow 0.13.0 (which tracks parquet-cpp-in-development 1.6.0, prior to introduction of LogicalType), TIMESTAMP_* is set for all 5 columns. This is the forward compatibility that we are planning to restore. > This confirms that the legacy types are only written for > UTC-normalized timestamps in 0.14.0. > >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Ah, that is not that unexpected to me. Pandas stores the data as object dtype (to preserve the mixed types, it is up to the user to force a certain conversion), and so for object dtype data you need to do inference to determine which arrow or parquet type to use (in contrast to the other, non-mixed columns, which in pandas use a designated datetime64 dtype). And inference can always be somewhat subjective or incomplete, and here pyarrow and fastparquet seem to make a different decision regarding this inference (where pyarrow is arguably making a wrong decision). Joris Op do 11 jul. 2019 om 10:04 schreef Zoltan Ivanfi : > Hi Joris, > > Out of curiosity I tried it with fastparquet as well and that couldn't > even save that column: > > ValueError: Can't infer object conversion type: 01970-01-01 > 01:00:00+01:00 > 11970-01-01 02:00:00+02:00 > Name: pd_mixed, dtype: object > > Br, > > Zoltan > > On Thu, Jul 11, 2019 at 3:55 PM Joris Van den Bossche > wrote: > > > > Created an issue for the mixed timezones here: > > https://issues.apache.org/jira/browse/ARROW-5912 > > > > Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche < > > jorisvandenboss...@gmail.com>: > > > > > Clarification regarding the mixed types (this is in the end not really > > > related to parquet, but to how pandas gets converted to pyarrow) > > > > > > Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi > > > >: > > > > > >> ... > > >> This matched my expectations up until pd_mixed. I was surprised to see > > >> that timestamps with mixed time zones were be stored using local > > >> semantics instead of being normalized to UTC, > > >> > > > > > > For the actual parquet writing semantics, it is more relevant to look > at > > > the arrow Table that gets created from this DataFrame: > > > > > > In [20]: pa.Table.from_pandas(df) > > > Out[20]: > > > pyarrow.Table > > > datetime: timestamp[ns] > > > pd_no_tz: timestamp[ns] > > > pd_paris: timestamp[ns, tz=Europe/Paris] > > > pd_helsinki: timestamp[ns, tz=Europe/Helsinki] > > > pd_mixed: timestamp[us] > > > > > > For all columns except for pd_mixed the result is clear and expected, > but > > > apparently pyarrow converts to the mixed timestamps to a TimestampArray > > > without timezone using the "local times", and not the UTC normalized > times. > > > > > > Now, that certainly feels a bit buggy to me (or at least unexpected). > But, > > > this is an issue for the python -> arrow conversion, not related to the > > > actual parquet writing. I will open a separate JIRA for this. > > > > > > Joris > > > >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Joris, Out of curiosity I tried it with fastparquet as well and that couldn't even save that column: ValueError: Can't infer object conversion type: 01970-01-01 01:00:00+01:00 11970-01-01 02:00:00+02:00 Name: pd_mixed, dtype: object Br, Zoltan On Thu, Jul 11, 2019 at 3:55 PM Joris Van den Bossche wrote: > > Created an issue for the mixed timezones here: > https://issues.apache.org/jira/browse/ARROW-5912 > > Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche < > jorisvandenboss...@gmail.com>: > > > Clarification regarding the mixed types (this is in the end not really > > related to parquet, but to how pandas gets converted to pyarrow) > > > > Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi > >: > > > >> ... > >> This matched my expectations up until pd_mixed. I was surprised to see > >> that timestamps with mixed time zones were be stored using local > >> semantics instead of being normalized to UTC, > >> > > > > For the actual parquet writing semantics, it is more relevant to look at > > the arrow Table that gets created from this DataFrame: > > > > In [20]: pa.Table.from_pandas(df) > > Out[20]: > > pyarrow.Table > > datetime: timestamp[ns] > > pd_no_tz: timestamp[ns] > > pd_paris: timestamp[ns, tz=Europe/Paris] > > pd_helsinki: timestamp[ns, tz=Europe/Helsinki] > > pd_mixed: timestamp[us] > > > > For all columns except for pd_mixed the result is clear and expected, but > > apparently pyarrow converts to the mixed timestamps to a TimestampArray > > without timezone using the "local times", and not the UTC normalized times. > > > > Now, that certainly feels a bit buggy to me (or at least unexpected). But, > > this is an issue for the python -> arrow conversion, not related to the > > actual parquet writing. I will open a separate JIRA for this. > > > > Joris > >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Clarification regarding the mixed types (this is in the end not really related to parquet, but to how pandas gets converted to pyarrow) Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi : > ... > This matched my expectations up until pd_mixed. I was surprised to see > that timestamps with mixed time zones were be stored using local > semantics instead of being normalized to UTC, > For the actual parquet writing semantics, it is more relevant to look at the arrow Table that gets created from this DataFrame: In [20]: pa.Table.from_pandas(df) Out[20]: pyarrow.Table datetime: timestamp[ns] pd_no_tz: timestamp[ns] pd_paris: timestamp[ns, tz=Europe/Paris] pd_helsinki: timestamp[ns, tz=Europe/Helsinki] pd_mixed: timestamp[us] For all columns except for pd_mixed the result is clear and expected, but apparently pyarrow converts to the mixed timestamps to a TimestampArray without timezone using the "local times", and not the UTC normalized times. Now, that certainly feels a bit buggy to me (or at least unexpected). But, this is an issue for the python -> arrow conversion, not related to the actual parquet writing. I will open a separate JIRA for this. Joris
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Tim, I was just told that arrow 0.14.0 is already out and uses the new logical types. This means that your original suggestion of changing the boolean to an enum is not really a viable option. Regarding your second suggestion, an additional field may be added to parquet-format, but I'm still concerned that it would be confusing and that it would provide little value in practice. However, thinking more about the problem I realized that we do not have to add anything to parquet-format to record the uncertainty of timestamp semantics, since we already have a way for doing so by using the legacy timestamp types. It is enough to expose the possiblity of having undefined semantics in the API. This can be deduced from the logical types stored in the file in the following way: Logical type in file => Logical type in API --- new TS type with semantics A => TS with semantics A new TS type with semantics B => TS with semantics B legacy TS type=> TS with unknown semantics For example, TIMESTAMP_MICROS would map to unknown semantics in the API while TIMESTAMP(isAdjustedToUTC=true, unit=MICROS) would map to instant semantics. Please let me know your thoughts on this suggestion. Thanks, Zoltan On Wed, Jul 10, 2019 at 6:43 PM TP Boudreau wrote: > > Sorry for the quick self-reply, but after brief reflection I think two > changes to my alternative proposal are required: > > 1. The proposed new field should be a parameter to the TimestampType, not > FileMetaData -- file level adds unnecessary complication / opportunities > for mischief. > 2. Although reported vs. inferred is the logical distinction, practically > this change is about whether or not the TimestampType was built from a > TIMESTAMP converted type, so the name should reflect that. > > After these amendments, this option boils down to: add a new boolean > parameter to the TimestampType named (something like) fromConvertedType. > > --TPB > > On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau wrote: > > > Hi Zoltan, > > > > Thank you for the helpful clarification of the community's understanding > > of the TIMESTAMP annotation. > > > > The core of the problem (IMHO) is that there no way to distinguish in the > > new LogicalType TimestampType between the case where UTC-normalization has > > been directly reported (via a user supplied TimestampType boolean > > parameter) or merely inferred (from a naked TIMESTAMP converted type). So > > perhaps another alternative might be to retain the isAdjustedToUTC boolean > > as is and add another field that indicates whether the adjusted flag was > > REPORTED or INFERRED (could be boolean or short variant or some type). > > This would allow interested readers to differentiate between old and new > > timestamps, while allowing other readers to enjoy the default you believe > > is warranted. It seems most straightforward that it would be an additional > > parameter on the TimestampType, but I supposed it could reside in the > > FileMetaData struct (on the assumption that the schema elements, having > > been written by the same writer, all uniformly use converted type or > > LogicalType). > > > > --Tim > > > > > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi > > wrote: > > > >> Hi Tim, > >> > >> In my opinion the specification of the older timestamp types only allowed > >> UTC-normalized storage, since these types were defined as the number of > >> milli/microseconds elapsed since the Unix epoch. This clearly defines the > >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. > >> 1970-01-01 00:00:00 UTC. It does not say anything about how this value > >> must > >> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but > >> typically it is displayed adjusted to the user's local timezone, for > >> example "1970-01-01 01:00:00" for a user in Paris. I don't think this > >> definition allows interpreting the numeric value 0 as "1970-01-01 > >> 00:00:00" > >> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, > >> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or > >> 10^6 for _MICROS) instead. > >> > >> I realize that compatibility with real-life usage patterns is important > >> regardless of whether they comply with the specification or not, but I > >> can't think of any solution that would be useful in practice. The > >> suggestion to turn the boolean into an enum would certainly allow Parquet > >> to have timestamps with unknown semantics, but I don't know what value > >> that > >> would bring to applications and how they would use it. I'm also afraid > >> that > >> the undefined semantics would get misused/overused by developers who are > >> not sure about the difference between the two semantics and we would end > >> up > >> with a lot of meaningless timestamps. > >> > >> Even with the problems I listed your suggestion may still be better than > >> the
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Wes, I did a little bit of testing using pyarrow 0.14.0. I know that this is not the latest state of the code, but my understanding is that the only planned change is that 0.14.1 will add the legacy types even for local semantics. Here is what I did: In [1]: import pandas as pd ...: from datetime import datetime ...: from pandas import Timestamp ...: from pytz import timezone ...: ts_dt = datetime(1970, 1, 1) ...: ts_pd = Timestamp(year=1970, month=1, day=1) ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1, tz=timezone("Europe/Paris")) ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1, hour=2, tz=timezone("Europe/Helsinki")) ...: df = pd.DataFrame({ ...: 'datetime': [ts_dt, None], ...: 'pd_no_tz': [ts_pd, None], ...: 'pd_paris': [ts_pd_paris, None], ...: 'pd_helsinki': [ts_pd_helsinki, None], ...: 'pd_mixed': [ts_pd_paris, ts_pd_helsinki] ...: }) ...: df.to_parquet('test.parquet') ...: df Out[1]: datetime pd_no_tz pd_paris pd_helsinki pd_mixed 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 02:00:00+02:00 1970-01-01 01:00:00+01:00 1NaTNaT NaT NaT 1970-01-01 02:00:00+02:00 I picked these values because I expected all of these timestamps to be saved as the numeric value 0. Checking the metadata using parquet-tools: > parquet-tools meta test.parquet datetime:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 pd_no_tz:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 pd_paris:OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1 pd_mixed:OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1 This matched my expectations up until pd_mixed. I was surprised to see that timestamps with mixed time zones were be stored using local semantics instead of being normalized to UTC, but if it's an intentional choice, I'm fine with it as long as the numbers are correct: > parquet-tools head test.parquet datetime = 0 pd_no_tz = 0 pd_paris = 0 pd_helsinki = 0 pd_mixed = 360 pd_mixed = 720 The numbers for the mixed column are not 0, but that is just the result of not using UTC-normalized semantics there. The values can be correctly interpreted by parquet-tools: > parquet-tools dump test.parquet INT64 datetime *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:1970-01-01T00:00:00.000 value 2: R:0 D:0 V: INT64 pd_no_tz *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:1970-01-01T00:00:00.000 value 2: R:0 D:0 V: INT64 pd_paris *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:1970-01-01T00:00:00.000+ value 2: R:0 D:0 V: INT64 pd_helsinki *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:1970-01-01T00:00:00.000+ value 2: R:0 D:0 V: INT64 pd_mixed *** row group 1 of 1, values 1 to 2 *** value 1: R:0 D:1 V:1970-01-01T01:00:00.000 value 2: R:0 D:1 V:1970-01-01T02:00:00.000 And naturally by pandas as well: In [2]: df2 = pd.read_parquet('test.parquet') ...: df2 Out[2]: datetime pd_no_tz pd_paris pd_helsinkipd_mixed 0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01 02:00:00+02:00 1970-01-01 01:00:00 1NaTNaT NaT NaT 1970-01-01 02:00:00 Finally, let's try an older version of parquet-tools as well: > parquet-tools meta test.parquet datetime:OPTIONAL INT64 R:0 D:1 pd_no_tz:OPTIONAL INT64 R:0 D:1 pd_paris:OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1 pd_mixed:OPTIONAL INT64 R:0 D:1 This confirms that the legacy types are only written for UTC-normalized timestamps in 0.14.0. In summary, when saving two timestamps from different time zones that refer to the same instant, I would have expected pyarrow to normalize to UTC and thereby sacrifice the local representations instead of saving using local semantics and thereby sacrificing the instants. I don't know whether that is the intended behaviour, but in any case, based on this short manual testing, the new timestamp types written by pyarrow are interopable with the Java library. Br, Zoltan On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Sorry for the quick self-reply, but after brief reflection I think two changes to my alternative proposal are required: 1. The proposed new field should be a parameter to the TimestampType, not FileMetaData -- file level adds unnecessary complication / opportunities for mischief. 2. Although reported vs. inferred is the logical distinction, practically this change is about whether or not the TimestampType was built from a TIMESTAMP converted type, so the name should reflect that. After these amendments, this option boils down to: add a new boolean parameter to the TimestampType named (something like) fromConvertedType. --TPB On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau wrote: > Hi Zoltan, > > Thank you for the helpful clarification of the community's understanding > of the TIMESTAMP annotation. > > The core of the problem (IMHO) is that there no way to distinguish in the > new LogicalType TimestampType between the case where UTC-normalization has > been directly reported (via a user supplied TimestampType boolean > parameter) or merely inferred (from a naked TIMESTAMP converted type). So > perhaps another alternative might be to retain the isAdjustedToUTC boolean > as is and add another field that indicates whether the adjusted flag was > REPORTED or INFERRED (could be boolean or short variant or some type). > This would allow interested readers to differentiate between old and new > timestamps, while allowing other readers to enjoy the default you believe > is warranted. It seems most straightforward that it would be an additional > parameter on the TimestampType, but I supposed it could reside in the > FileMetaData struct (on the assumption that the schema elements, having > been written by the same writer, all uniformly use converted type or > LogicalType). > > --Tim > > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi > wrote: > >> Hi Tim, >> >> In my opinion the specification of the older timestamp types only allowed >> UTC-normalized storage, since these types were defined as the number of >> milli/microseconds elapsed since the Unix epoch. This clearly defines the >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. >> 1970-01-01 00:00:00 UTC. It does not say anything about how this value >> must >> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but >> typically it is displayed adjusted to the user's local timezone, for >> example "1970-01-01 01:00:00" for a user in Paris. I don't think this >> definition allows interpreting the numeric value 0 as "1970-01-01 >> 00:00:00" >> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, >> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or >> 10^6 for _MICROS) instead. >> >> I realize that compatibility with real-life usage patterns is important >> regardless of whether they comply with the specification or not, but I >> can't think of any solution that would be useful in practice. The >> suggestion to turn the boolean into an enum would certainly allow Parquet >> to have timestamps with unknown semantics, but I don't know what value >> that >> would bring to applications and how they would use it. I'm also afraid >> that >> the undefined semantics would get misused/overused by developers who are >> not sure about the difference between the two semantics and we would end >> up >> with a lot of meaningless timestamps. >> >> Even with the problems I listed your suggestion may still be better than >> the current solution, but before making a community decision I would like >> to continue this discussion focusing on three questions: >> >>- What are the implications of this change? >>- How will unknown semantics be used in practice? >>- Does it bring value? >>- Can we do better? >>- Can we even change the boolean to an enum? It has been specified like >>that and released a long time ago. Although I am not aware of any >> software >>component that would have already implemented it, I was also unaware of >>software components using TIMESTAMP_MILLIS and _MICROS for local >> semantics. >> >> One alternative that comes to my mind is to default to the more common >> UTC-normalized semantics but allow overriding it in the reader schema. >> >> Thanks, >> >> Zoltan >> >> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau wrote: >> >> > I'm not a long-time Parquet user, but I assisted in the expansion of the >> > parquet-cpp library's LogicalType facility. >> > >> > My impression is that the original TIMESTAMP converted types were >> silent on >> > whether the annotated value was UTC adjusted and that (often arcane) >> > out-of-band information had to be relied on by readers to decide the UTC >> > adjustment status for timestamp columns. It seemed to me that that >> > perceived shortcoming was a primary motivator for adding the >> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp >> > LogicalType. If that impression is accurate,
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Zoltan, Thank you for the helpful clarification of the community's understanding of the TIMESTAMP annotation. The core of the problem (IMHO) is that there no way to distinguish in the new LogicalType TimestampType between the case where UTC-normalization has been directly reported (via a user supplied TimestampType boolean parameter) or merely inferred (from a naked TIMESTAMP converted type). So perhaps another alternative might be to retain the isAdjustedToUTC boolean as is and add another field that indicates whether the adjusted flag was REPORTED or INFERRED (could be boolean or short variant or some type). This would allow interested readers to differentiate between old and new timestamps, while allowing other readers to enjoy the default you believe is warranted. It seems most straightforward that it would be an additional parameter on the TimestampType, but I supposed it could reside in the FileMetaData struct (on the assumption that the schema elements, having been written by the same writer, all uniformly use converted type or LogicalType). --Tim On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi wrote: > Hi Tim, > > In my opinion the specification of the older timestamp types only allowed > UTC-normalized storage, since these types were defined as the number of > milli/microseconds elapsed since the Unix epoch. This clearly defines the > meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. > 1970-01-01 00:00:00 UTC. It does not say anything about how this value must > be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but > typically it is displayed adjusted to the user's local timezone, for > example "1970-01-01 01:00:00" for a user in Paris. I don't think this > definition allows interpreting the numeric value 0 as "1970-01-01 00:00:00" > in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, > which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or > 10^6 for _MICROS) instead. > > I realize that compatibility with real-life usage patterns is important > regardless of whether they comply with the specification or not, but I > can't think of any solution that would be useful in practice. The > suggestion to turn the boolean into an enum would certainly allow Parquet > to have timestamps with unknown semantics, but I don't know what value that > would bring to applications and how they would use it. I'm also afraid that > the undefined semantics would get misused/overused by developers who are > not sure about the difference between the two semantics and we would end up > with a lot of meaningless timestamps. > > Even with the problems I listed your suggestion may still be better than > the current solution, but before making a community decision I would like > to continue this discussion focusing on three questions: > >- What are the implications of this change? >- How will unknown semantics be used in practice? >- Does it bring value? >- Can we do better? >- Can we even change the boolean to an enum? It has been specified like >that and released a long time ago. Although I am not aware of any > software >component that would have already implemented it, I was also unaware of >software components using TIMESTAMP_MILLIS and _MICROS for local > semantics. > > One alternative that comes to my mind is to default to the more common > UTC-normalized semantics but allow overriding it in the reader schema. > > Thanks, > > Zoltan > > On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau wrote: > > > I'm not a long-time Parquet user, but I assisted in the expansion of the > > parquet-cpp library's LogicalType facility. > > > > My impression is that the original TIMESTAMP converted types were silent > on > > whether the annotated value was UTC adjusted and that (often arcane) > > out-of-band information had to be relied on by readers to decide the UTC > > adjustment status for timestamp columns. It seemed to me that that > > perceived shortcoming was a primary motivator for adding the > > isAdjustedToUTC boolean parameter to the corresponding new Timestamp > > LogicalType. If that impression is accurate, then when reading TIMESTAMP > > columns written by legacy (converted type only) writers, it seems > > inappropriate for LogicalType aware readers to unconditionally assign > > *either* "false" or "true" (as currently required) to a boolean UTC > > adjusted parameter, as that requires the reader to infer a property that > > wasn't implied by the writer. > > > > One possible approach to untangling this might be to amend the > > parquet.thrift specification to change the isAdjustedToUTC boolean > property > > to an enum or union type (some enumerated list) named (for example) > > UTCAdjustment with three possible values: Unknown, UTCAdjusted, > > NotUTCAdjusted (I'm not married to the names). Extant files with > TIMESTAMP > > converted types only would map for forward compatibility to Timestamp > >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Correct On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi wrote: > > Hi Wes, > > Do you mean that the new logical types have already been released in 0.14.0 > and a 0.14.1 is needed ASAP to fix this regression? > > Thanks, > > Zoltan > > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney wrote: > > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these > > issues and others we need to make a new release within the next 7-10 > > days. We can point you to nightly Python builds to make testing for > > you easier so you don't have to build the project yourself. > > > > - Wes > > > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi > > wrote: > > > > > > Hi, > > > > > > Oh, and one more thing: Before releasing the next Arrow version > > > incorporating the new logical types, we should definitely test that their > > > behaviour matches that of parquet-mr. When is the next release planned to > > > come out? > > > > > > Br, > > > > > > Zoltan > > > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi wrote: > > > > > > > Hi Wes, > > > > > > > > Yes, I agree that we should do that, but then we have a problem of > > what to > > > > do in the other direction, i.e. when we use the new logical types API > > to > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC > > > > normalized flag? Tim has started a discussion about that, suggesting > > to use > > > > three states that I just answered. > > > > > > > > Br, > > > > > > > > Zoltan > > > > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney > > wrote: > > > > > > > >> Thank for the comments. > > > >> > > > >> So in summary I think that we need to set the TIMESTAMP_* converted > > > >> types to maintain forward compatibility and stay consistent with what > > > >> we were doing in the C++ library prior to the introduction of the > > > >> LogicalType metadata. > > > >> > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi > > > >> > > > > > >> wrote: > > > >> > > > > >> > Hi Wes, > > > >> > > > > >> > Both of the semantics are deterministic in one aspect and > > > >> indeterministic > > > >> > in another. Timestamps of instant semantic will always refer to the > > same > > > >> > instant, but their user-facing representation (how they get > > displayed) > > > >> > depends on the user's time zone. Timestamps of local semantics > > always > > > >> have > > > >> > the same user-facing representation but the instant they refer to is > > > >> > undefined (or ambigous, depending on your point of view). > > > >> > > > > >> > My understanding is that Spark uses instant semantics, i.e., > > timestamps > > > >> are > > > >> > stored normalized to UTC and are displayed adjusted to the user's > > local > > > >> > time zone. > > > >> > > > > >> > Br, > > > >> > > > > >> > Zoltan > > > >> > > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney > > > >> wrote: > > > >> > > > > >> > > Thanks Zoltan. > > > >> > > > > > >> > > This is definitely a tricky issue. > > > >> > > > > > >> > > Spark's application of localtime semantics to timestamp data has > > been > > > >> > > a source of issues for many people. Personally I don't find that > > > >> > > behavior to be particularly helpful since depending on the session > > > >> > > time zone, you will get different results on data not marked as > > > >> > > UTC-normalized. > > > >> > > > > > >> > > In pandas, as contrast, we apply UTC semantics to > > > >> > > naive/not-explicitly-normalized data so at least the code produces > > > >> > > deterministic results on all environments. That seems strictly > > better > > > >> > > to me -- if you want a localized interpretation of naive data, you > > > >> > > must opt into this rather than having it automatically selected > > based > > > >> > > on your locale. The instances of people shooting their toes off > > due to > > > >> > > time zones are practically non-existent, whereas I'm hearing about > > > >> > > Spark gotchas all the time. > > > >> > > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > > > > >> > > > > >> > > wrote: > > > >> > > > > > > >> > > > Hi Wes, > > > >> > > > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created > > based on > > > >> the > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only > > > >> been used > > > >> > > > in the instant aka. UTC-normalized semantics so far. This > > > >> assumption was > > > >> > > > supported by two sources: > > > >> > > > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS > > and > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed > > since > > > >> the > > > >> > > Unix > > > >> > > > epoch, an instant specified in UTC, from which it follows that > > they > > > >> have > > > >> > > > instant semantics (because timestamps of local semantics do not > > > >> > > correspond > > > >> > > > to a single instant). > > > >> > > > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Wes, Do you mean that the new logical types have already been released in 0.14.0 and a 0.14.1 is needed ASAP to fix this regression? Thanks, Zoltan On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney wrote: > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these > issues and others we need to make a new release within the next 7-10 > days. We can point you to nightly Python builds to make testing for > you easier so you don't have to build the project yourself. > > - Wes > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi > wrote: > > > > Hi, > > > > Oh, and one more thing: Before releasing the next Arrow version > > incorporating the new logical types, we should definitely test that their > > behaviour matches that of parquet-mr. When is the next release planned to > > come out? > > > > Br, > > > > Zoltan > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi wrote: > > > > > Hi Wes, > > > > > > Yes, I agree that we should do that, but then we have a problem of > what to > > > do in the other direction, i.e. when we use the new logical types API > to > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC > > > normalized flag? Tim has started a discussion about that, suggesting > to use > > > three states that I just answered. > > > > > > Br, > > > > > > Zoltan > > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney > wrote: > > > > > >> Thank for the comments. > > >> > > >> So in summary I think that we need to set the TIMESTAMP_* converted > > >> types to maintain forward compatibility and stay consistent with what > > >> we were doing in the C++ library prior to the introduction of the > > >> LogicalType metadata. > > >> > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi > > > >> wrote: > > >> > > > >> > Hi Wes, > > >> > > > >> > Both of the semantics are deterministic in one aspect and > > >> indeterministic > > >> > in another. Timestamps of instant semantic will always refer to the > same > > >> > instant, but their user-facing representation (how they get > displayed) > > >> > depends on the user's time zone. Timestamps of local semantics > always > > >> have > > >> > the same user-facing representation but the instant they refer to is > > >> > undefined (or ambigous, depending on your point of view). > > >> > > > >> > My understanding is that Spark uses instant semantics, i.e., > timestamps > > >> are > > >> > stored normalized to UTC and are displayed adjusted to the user's > local > > >> > time zone. > > >> > > > >> > Br, > > >> > > > >> > Zoltan > > >> > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney > > >> wrote: > > >> > > > >> > > Thanks Zoltan. > > >> > > > > >> > > This is definitely a tricky issue. > > >> > > > > >> > > Spark's application of localtime semantics to timestamp data has > been > > >> > > a source of issues for many people. Personally I don't find that > > >> > > behavior to be particularly helpful since depending on the session > > >> > > time zone, you will get different results on data not marked as > > >> > > UTC-normalized. > > >> > > > > >> > > In pandas, as contrast, we apply UTC semantics to > > >> > > naive/not-explicitly-normalized data so at least the code produces > > >> > > deterministic results on all environments. That seems strictly > better > > >> > > to me -- if you want a localized interpretation of naive data, you > > >> > > must opt into this rather than having it automatically selected > based > > >> > > on your locale. The instances of people shooting their toes off > due to > > >> > > time zones are practically non-existent, whereas I'm hearing about > > >> > > Spark gotchas all the time. > > >> > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > > >> > > > >> > > wrote: > > >> > > > > > >> > > > Hi Wes, > > >> > > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created > based on > > >> the > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only > > >> been used > > >> > > > in the instant aka. UTC-normalized semantics so far. This > > >> assumption was > > >> > > > supported by two sources: > > >> > > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS > and > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed > since > > >> the > > >> > > Unix > > >> > > > epoch, an instant specified in UTC, from which it follows that > they > > >> have > > >> > > > instant semantics (because timestamps of local semantics do not > > >> > > correspond > > >> > > > to a single instant). > > >> > > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software > component > > >> that > > >> > > > used these types differently from the specification. > > >> > > > > > >> > > > Based on your e-mail, we were wrong on #2. > > >> > > > > > >> > > > From this false premise it followed that TIMESTAMPs with local > > >> semantics > > >> > > > were a new type and did not need to be annotated with the old > types > > >> to
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
hi Zoltan -- given the raging fire that is 0.14.0 as a result of these issues and others we need to make a new release within the next 7-10 days. We can point you to nightly Python builds to make testing for you easier so you don't have to build the project yourself. - Wes On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi wrote: > > Hi, > > Oh, and one more thing: Before releasing the next Arrow version > incorporating the new logical types, we should definitely test that their > behaviour matches that of parquet-mr. When is the next release planned to > come out? > > Br, > > Zoltan > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi wrote: > > > Hi Wes, > > > > Yes, I agree that we should do that, but then we have a problem of what to > > do in the other direction, i.e. when we use the new logical types API to > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC > > normalized flag? Tim has started a discussion about that, suggesting to use > > three states that I just answered. > > > > Br, > > > > Zoltan > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney wrote: > > > >> Thank for the comments. > >> > >> So in summary I think that we need to set the TIMESTAMP_* converted > >> types to maintain forward compatibility and stay consistent with what > >> we were doing in the C++ library prior to the introduction of the > >> LogicalType metadata. > >> > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi > >> wrote: > >> > > >> > Hi Wes, > >> > > >> > Both of the semantics are deterministic in one aspect and > >> indeterministic > >> > in another. Timestamps of instant semantic will always refer to the same > >> > instant, but their user-facing representation (how they get displayed) > >> > depends on the user's time zone. Timestamps of local semantics always > >> have > >> > the same user-facing representation but the instant they refer to is > >> > undefined (or ambigous, depending on your point of view). > >> > > >> > My understanding is that Spark uses instant semantics, i.e., timestamps > >> are > >> > stored normalized to UTC and are displayed adjusted to the user's local > >> > time zone. > >> > > >> > Br, > >> > > >> > Zoltan > >> > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney > >> wrote: > >> > > >> > > Thanks Zoltan. > >> > > > >> > > This is definitely a tricky issue. > >> > > > >> > > Spark's application of localtime semantics to timestamp data has been > >> > > a source of issues for many people. Personally I don't find that > >> > > behavior to be particularly helpful since depending on the session > >> > > time zone, you will get different results on data not marked as > >> > > UTC-normalized. > >> > > > >> > > In pandas, as contrast, we apply UTC semantics to > >> > > naive/not-explicitly-normalized data so at least the code produces > >> > > deterministic results on all environments. That seems strictly better > >> > > to me -- if you want a localized interpretation of naive data, you > >> > > must opt into this rather than having it automatically selected based > >> > > on your locale. The instances of people shooting their toes off due to > >> > > time zones are practically non-existent, whereas I'm hearing about > >> > > Spark gotchas all the time. > >> > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > >> > > >> > > >> > > wrote: > >> > > > > >> > > > Hi Wes, > >> > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created based on > >> the > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only > >> been used > >> > > > in the instant aka. UTC-normalized semantics so far. This > >> assumption was > >> > > > supported by two sources: > >> > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since > >> the > >> > > Unix > >> > > > epoch, an instant specified in UTC, from which it follows that they > >> have > >> > > > instant semantics (because timestamps of local semantics do not > >> > > correspond > >> > > > to a single instant). > >> > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software component > >> that > >> > > > used these types differently from the specification. > >> > > > > >> > > > Based on your e-mail, we were wrong on #2. > >> > > > > >> > > > From this false premise it followed that TIMESTAMPs with local > >> semantics > >> > > > were a new type and did not need to be annotated with the old types > >> to > >> > > > maintain compatibility. In fact, annotating them with the old types > >> were > >> > > > considered to be harmful, since it would have mislead older readers > >> into > >> > > > thinking that they can read TIMESTAMPs with local semantics, when in > >> > > > reality they would have misinterpreted them as TIMESTAMPs with > >> instant > >> > > > semantics. This would have lead to a difference of several hours, > >> > > > corresponding to the time zone offset. > >> > > > >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi, Oh, and one more thing: Before releasing the next Arrow version incorporating the new logical types, we should definitely test that their behaviour matches that of parquet-mr. When is the next release planned to come out? Br, Zoltan On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi wrote: > Hi Wes, > > Yes, I agree that we should do that, but then we have a problem of what to > do in the other direction, i.e. when we use the new logical types API to > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC > normalized flag? Tim has started a discussion about that, suggesting to use > three states that I just answered. > > Br, > > Zoltan > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney wrote: > >> Thank for the comments. >> >> So in summary I think that we need to set the TIMESTAMP_* converted >> types to maintain forward compatibility and stay consistent with what >> we were doing in the C++ library prior to the introduction of the >> LogicalType metadata. >> >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi >> wrote: >> > >> > Hi Wes, >> > >> > Both of the semantics are deterministic in one aspect and >> indeterministic >> > in another. Timestamps of instant semantic will always refer to the same >> > instant, but their user-facing representation (how they get displayed) >> > depends on the user's time zone. Timestamps of local semantics always >> have >> > the same user-facing representation but the instant they refer to is >> > undefined (or ambigous, depending on your point of view). >> > >> > My understanding is that Spark uses instant semantics, i.e., timestamps >> are >> > stored normalized to UTC and are displayed adjusted to the user's local >> > time zone. >> > >> > Br, >> > >> > Zoltan >> > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney >> wrote: >> > >> > > Thanks Zoltan. >> > > >> > > This is definitely a tricky issue. >> > > >> > > Spark's application of localtime semantics to timestamp data has been >> > > a source of issues for many people. Personally I don't find that >> > > behavior to be particularly helpful since depending on the session >> > > time zone, you will get different results on data not marked as >> > > UTC-normalized. >> > > >> > > In pandas, as contrast, we apply UTC semantics to >> > > naive/not-explicitly-normalized data so at least the code produces >> > > deterministic results on all environments. That seems strictly better >> > > to me -- if you want a localized interpretation of naive data, you >> > > must opt into this rather than having it automatically selected based >> > > on your locale. The instances of people shooting their toes off due to >> > > time zones are practically non-existent, whereas I'm hearing about >> > > Spark gotchas all the time. >> > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > > >> > > wrote: >> > > > >> > > > Hi Wes, >> > > > >> > > > The rules for TIMESTAMP forward-compatibility were created based on >> the >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only >> been used >> > > > in the instant aka. UTC-normalized semantics so far. This >> assumption was >> > > > supported by two sources: >> > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since >> the >> > > Unix >> > > > epoch, an instant specified in UTC, from which it follows that they >> have >> > > > instant semantics (because timestamps of local semantics do not >> > > correspond >> > > > to a single instant). >> > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software component >> that >> > > > used these types differently from the specification. >> > > > >> > > > Based on your e-mail, we were wrong on #2. >> > > > >> > > > From this false premise it followed that TIMESTAMPs with local >> semantics >> > > > were a new type and did not need to be annotated with the old types >> to >> > > > maintain compatibility. In fact, annotating them with the old types >> were >> > > > considered to be harmful, since it would have mislead older readers >> into >> > > > thinking that they can read TIMESTAMPs with local semantics, when in >> > > > reality they would have misinterpreted them as TIMESTAMPs with >> instant >> > > > semantics. This would have lead to a difference of several hours, >> > > > corresponding to the time zone offset. >> > > > >> > > > In the light of your e-mail, this misinterpretation of timestamps >> may >> > > > already be happening, since if Arrow annotates local timestamps with >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets >> them >> > > as >> > > > timestamps with instant semantics, leading to a difference of >> several >> > > hours. >> > > > >> > > > Based on this, I think it would make sense from Arrow's point of >> view to >> > > > annotate both semantics with the old types, since that is its >> historical >> > > > behaviour and keeping it up is needed for
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Wes, Yes, I agree that we should do that, but then we have a problem of what to do in the other direction, i.e. when we use the new logical types API to read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC normalized flag? Tim has started a discussion about that, suggesting to use three states that I just answered. Br, Zoltan On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney wrote: > Thank for the comments. > > So in summary I think that we need to set the TIMESTAMP_* converted > types to maintain forward compatibility and stay consistent with what > we were doing in the C++ library prior to the introduction of the > LogicalType metadata. > > On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi > wrote: > > > > Hi Wes, > > > > Both of the semantics are deterministic in one aspect and indeterministic > > in another. Timestamps of instant semantic will always refer to the same > > instant, but their user-facing representation (how they get displayed) > > depends on the user's time zone. Timestamps of local semantics always > have > > the same user-facing representation but the instant they refer to is > > undefined (or ambigous, depending on your point of view). > > > > My understanding is that Spark uses instant semantics, i.e., timestamps > are > > stored normalized to UTC and are displayed adjusted to the user's local > > time zone. > > > > Br, > > > > Zoltan > > > > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney wrote: > > > > > Thanks Zoltan. > > > > > > This is definitely a tricky issue. > > > > > > Spark's application of localtime semantics to timestamp data has been > > > a source of issues for many people. Personally I don't find that > > > behavior to be particularly helpful since depending on the session > > > time zone, you will get different results on data not marked as > > > UTC-normalized. > > > > > > In pandas, as contrast, we apply UTC semantics to > > > naive/not-explicitly-normalized data so at least the code produces > > > deterministic results on all environments. That seems strictly better > > > to me -- if you want a localized interpretation of naive data, you > > > must opt into this rather than having it automatically selected based > > > on your locale. The instances of people shooting their toes off due to > > > time zones are practically non-existent, whereas I'm hearing about > > > Spark gotchas all the time. > > > > > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > > > > wrote: > > > > > > > > Hi Wes, > > > > > > > > The rules for TIMESTAMP forward-compatibility were created based on > the > > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been > used > > > > in the instant aka. UTC-normalized semantics so far. This assumption > was > > > > supported by two sources: > > > > > > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and > > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since > the > > > Unix > > > > epoch, an instant specified in UTC, from which it follows that they > have > > > > instant semantics (because timestamps of local semantics do not > > > correspond > > > > to a single instant). > > > > > > > > 2. Anecdotal knowledge: We were not aware of any software component > that > > > > used these types differently from the specification. > > > > > > > > Based on your e-mail, we were wrong on #2. > > > > > > > > From this false premise it followed that TIMESTAMPs with local > semantics > > > > were a new type and did not need to be annotated with the old types > to > > > > maintain compatibility. In fact, annotating them with the old types > were > > > > considered to be harmful, since it would have mislead older readers > into > > > > thinking that they can read TIMESTAMPs with local semantics, when in > > > > reality they would have misinterpreted them as TIMESTAMPs with > instant > > > > semantics. This would have lead to a difference of several hours, > > > > corresponding to the time zone offset. > > > > > > > > In the light of your e-mail, this misinterpretation of timestamps may > > > > already be happening, since if Arrow annotates local timestamps with > > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets > them > > > as > > > > timestamps with instant semantics, leading to a difference of several > > > hours. > > > > > > > > Based on this, I think it would make sense from Arrow's point of > view to > > > > annotate both semantics with the old types, since that is its > historical > > > > behaviour and keeping it up is needed for maintaining compatibilty. > I'm > > > not > > > > so sure about the Java library though, since as far as I know, these > > > types > > > > were never used in the local sense there (although I may be wrong > again). > > > > Were we to decide that Arrow and parquet-mr should behave > differently in > > > > this aspect though, it may be tricky to convey this distinction in > the > > > > specification. I would be interested in hearing your and
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Thank for the comments. So in summary I think that we need to set the TIMESTAMP_* converted types to maintain forward compatibility and stay consistent with what we were doing in the C++ library prior to the introduction of the LogicalType metadata. On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi wrote: > > Hi Wes, > > Both of the semantics are deterministic in one aspect and indeterministic > in another. Timestamps of instant semantic will always refer to the same > instant, but their user-facing representation (how they get displayed) > depends on the user's time zone. Timestamps of local semantics always have > the same user-facing representation but the instant they refer to is > undefined (or ambigous, depending on your point of view). > > My understanding is that Spark uses instant semantics, i.e., timestamps are > stored normalized to UTC and are displayed adjusted to the user's local > time zone. > > Br, > > Zoltan > > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney wrote: > > > Thanks Zoltan. > > > > This is definitely a tricky issue. > > > > Spark's application of localtime semantics to timestamp data has been > > a source of issues for many people. Personally I don't find that > > behavior to be particularly helpful since depending on the session > > time zone, you will get different results on data not marked as > > UTC-normalized. > > > > In pandas, as contrast, we apply UTC semantics to > > naive/not-explicitly-normalized data so at least the code produces > > deterministic results on all environments. That seems strictly better > > to me -- if you want a localized interpretation of naive data, you > > must opt into this rather than having it automatically selected based > > on your locale. The instances of people shooting their toes off due to > > time zones are practically non-existent, whereas I'm hearing about > > Spark gotchas all the time. > > > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > > wrote: > > > > > > Hi Wes, > > > > > > The rules for TIMESTAMP forward-compatibility were created based on the > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used > > > in the instant aka. UTC-normalized semantics so far. This assumption was > > > supported by two sources: > > > > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the > > Unix > > > epoch, an instant specified in UTC, from which it follows that they have > > > instant semantics (because timestamps of local semantics do not > > correspond > > > to a single instant). > > > > > > 2. Anecdotal knowledge: We were not aware of any software component that > > > used these types differently from the specification. > > > > > > Based on your e-mail, we were wrong on #2. > > > > > > From this false premise it followed that TIMESTAMPs with local semantics > > > were a new type and did not need to be annotated with the old types to > > > maintain compatibility. In fact, annotating them with the old types were > > > considered to be harmful, since it would have mislead older readers into > > > thinking that they can read TIMESTAMPs with local semantics, when in > > > reality they would have misinterpreted them as TIMESTAMPs with instant > > > semantics. This would have lead to a difference of several hours, > > > corresponding to the time zone offset. > > > > > > In the light of your e-mail, this misinterpretation of timestamps may > > > already be happening, since if Arrow annotates local timestamps with > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them > > as > > > timestamps with instant semantics, leading to a difference of several > > hours. > > > > > > Based on this, I think it would make sense from Arrow's point of view to > > > annotate both semantics with the old types, since that is its historical > > > behaviour and keeping it up is needed for maintaining compatibilty. I'm > > not > > > so sure about the Java library though, since as far as I know, these > > types > > > were never used in the local sense there (although I may be wrong again). > > > Were we to decide that Arrow and parquet-mr should behave differently in > > > this aspect though, it may be tricky to convey this distinction in the > > > specification. I would be interested in hearing your and other > > developers' > > > opinions on this. > > > > > > Thanks, > > > > > > Zoltan > > > > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney wrote: > > > > > > > hi folks, > > > > > > > > We have just recently implemented the new LogicalType unions in the > > > > Parquet C++ library and we have run into a forward compatibility > > > > problem with reader versions prior to this implementation. > > > > > > > > To recap the issue, prior to the introduction of LogicalType, the > > > > Parquet format had no explicit notion of time zones or UTC > > > > normalization. The new TimestampType provides a flag to indicate > > > >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Tim, In my opinion the specification of the older timestamp types only allowed UTC-normalized storage, since these types were defined as the number of milli/microseconds elapsed since the Unix epoch. This clearly defines the meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. 1970-01-01 00:00:00 UTC. It does not say anything about how this value must be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but typically it is displayed adjusted to the user's local timezone, for example "1970-01-01 01:00:00" for a user in Paris. I don't think this definition allows interpreting the numeric value 0 as "1970-01-01 00:00:00" in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or 10^6 for _MICROS) instead. I realize that compatibility with real-life usage patterns is important regardless of whether they comply with the specification or not, but I can't think of any solution that would be useful in practice. The suggestion to turn the boolean into an enum would certainly allow Parquet to have timestamps with unknown semantics, but I don't know what value that would bring to applications and how they would use it. I'm also afraid that the undefined semantics would get misused/overused by developers who are not sure about the difference between the two semantics and we would end up with a lot of meaningless timestamps. Even with the problems I listed your suggestion may still be better than the current solution, but before making a community decision I would like to continue this discussion focusing on three questions: - What are the implications of this change? - How will unknown semantics be used in practice? - Does it bring value? - Can we do better? - Can we even change the boolean to an enum? It has been specified like that and released a long time ago. Although I am not aware of any software component that would have already implemented it, I was also unaware of software components using TIMESTAMP_MILLIS and _MICROS for local semantics. One alternative that comes to my mind is to default to the more common UTC-normalized semantics but allow overriding it in the reader schema. Thanks, Zoltan On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau wrote: > I'm not a long-time Parquet user, but I assisted in the expansion of the > parquet-cpp library's LogicalType facility. > > My impression is that the original TIMESTAMP converted types were silent on > whether the annotated value was UTC adjusted and that (often arcane) > out-of-band information had to be relied on by readers to decide the UTC > adjustment status for timestamp columns. It seemed to me that that > perceived shortcoming was a primary motivator for adding the > isAdjustedToUTC boolean parameter to the corresponding new Timestamp > LogicalType. If that impression is accurate, then when reading TIMESTAMP > columns written by legacy (converted type only) writers, it seems > inappropriate for LogicalType aware readers to unconditionally assign > *either* "false" or "true" (as currently required) to a boolean UTC > adjusted parameter, as that requires the reader to infer a property that > wasn't implied by the writer. > > One possible approach to untangling this might be to amend the > parquet.thrift specification to change the isAdjustedToUTC boolean property > to an enum or union type (some enumerated list) named (for example) > UTCAdjustment with three possible values: Unknown, UTCAdjusted, > NotUTCAdjusted (I'm not married to the names). Extant files with TIMESTAMP > converted types only would map for forward compatibility to Timestamp > LogicalTypes with UTCAdjustment:=Unknown . New files with user supplied > Timestamp LogicalTypes would always record the converted type as TIMESTAMP > for backward compatibility regardless of the value of the new UTCAdjustment > parameter (this would be lossy on a round-trip through a legacy library, > but that's unavoidable -- and the legacy libraries would be no worse off > than they are now). The specification would normatively state that new > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new > files). > > Thanks, Tim >
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Wes, Both of the semantics are deterministic in one aspect and indeterministic in another. Timestamps of instant semantic will always refer to the same instant, but their user-facing representation (how they get displayed) depends on the user's time zone. Timestamps of local semantics always have the same user-facing representation but the instant they refer to is undefined (or ambigous, depending on your point of view). My understanding is that Spark uses instant semantics, i.e., timestamps are stored normalized to UTC and are displayed adjusted to the user's local time zone. Br, Zoltan On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney wrote: > Thanks Zoltan. > > This is definitely a tricky issue. > > Spark's application of localtime semantics to timestamp data has been > a source of issues for many people. Personally I don't find that > behavior to be particularly helpful since depending on the session > time zone, you will get different results on data not marked as > UTC-normalized. > > In pandas, as contrast, we apply UTC semantics to > naive/not-explicitly-normalized data so at least the code produces > deterministic results on all environments. That seems strictly better > to me -- if you want a localized interpretation of naive data, you > must opt into this rather than having it automatically selected based > on your locale. The instances of people shooting their toes off due to > time zones are practically non-existent, whereas I'm hearing about > Spark gotchas all the time. > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > wrote: > > > > Hi Wes, > > > > The rules for TIMESTAMP forward-compatibility were created based on the > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used > > in the instant aka. UTC-normalized semantics so far. This assumption was > > supported by two sources: > > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the > Unix > > epoch, an instant specified in UTC, from which it follows that they have > > instant semantics (because timestamps of local semantics do not > correspond > > to a single instant). > > > > 2. Anecdotal knowledge: We were not aware of any software component that > > used these types differently from the specification. > > > > Based on your e-mail, we were wrong on #2. > > > > From this false premise it followed that TIMESTAMPs with local semantics > > were a new type and did not need to be annotated with the old types to > > maintain compatibility. In fact, annotating them with the old types were > > considered to be harmful, since it would have mislead older readers into > > thinking that they can read TIMESTAMPs with local semantics, when in > > reality they would have misinterpreted them as TIMESTAMPs with instant > > semantics. This would have lead to a difference of several hours, > > corresponding to the time zone offset. > > > > In the light of your e-mail, this misinterpretation of timestamps may > > already be happening, since if Arrow annotates local timestamps with > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them > as > > timestamps with instant semantics, leading to a difference of several > hours. > > > > Based on this, I think it would make sense from Arrow's point of view to > > annotate both semantics with the old types, since that is its historical > > behaviour and keeping it up is needed for maintaining compatibilty. I'm > not > > so sure about the Java library though, since as far as I know, these > types > > were never used in the local sense there (although I may be wrong again). > > Were we to decide that Arrow and parquet-mr should behave differently in > > this aspect though, it may be tricky to convey this distinction in the > > specification. I would be interested in hearing your and other > developers' > > opinions on this. > > > > Thanks, > > > > Zoltan > > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney wrote: > > > > > hi folks, > > > > > > We have just recently implemented the new LogicalType unions in the > > > Parquet C++ library and we have run into a forward compatibility > > > problem with reader versions prior to this implementation. > > > > > > To recap the issue, prior to the introduction of LogicalType, the > > > Parquet format had no explicit notion of time zones or UTC > > > normalization. The new TimestampType provides a flag to indicate > > > UTC-normalization > > > > > > struct TimestampType { > > > 1: required bool isAdjustedToUTC > > > 2: required TimeUnit unit > > > } > > > > > > When using this new type, the ConvertedType field must also be set for > > > forward compatibility (so that old readers can still understand the > > > data), but parquet.thrift says > > > > > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = > > > true, unit = MICROS) > > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = > > > true, unit
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
I'm not a long-time Parquet user, but I assisted in the expansion of the parquet-cpp library's LogicalType facility. My impression is that the original TIMESTAMP converted types were silent on whether the annotated value was UTC adjusted and that (often arcane) out-of-band information had to be relied on by readers to decide the UTC adjustment status for timestamp columns. It seemed to me that that perceived shortcoming was a primary motivator for adding the isAdjustedToUTC boolean parameter to the corresponding new Timestamp LogicalType. If that impression is accurate, then when reading TIMESTAMP columns written by legacy (converted type only) writers, it seems inappropriate for LogicalType aware readers to unconditionally assign *either* "false" or "true" (as currently required) to a boolean UTC adjusted parameter, as that requires the reader to infer a property that wasn't implied by the writer. One possible approach to untangling this might be to amend the parquet.thrift specification to change the isAdjustedToUTC boolean property to an enum or union type (some enumerated list) named (for example) UTCAdjustment with three possible values: Unknown, UTCAdjusted, NotUTCAdjusted (I'm not married to the names). Extant files with TIMESTAMP converted types only would map for forward compatibility to Timestamp LogicalTypes with UTCAdjustment:=Unknown . New files with user supplied Timestamp LogicalTypes would always record the converted type as TIMESTAMP for backward compatibility regardless of the value of the new UTCAdjustment parameter (this would be lossy on a round-trip through a legacy library, but that's unavoidable -- and the legacy libraries would be no worse off than they are now). The specification would normatively state that new user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new files). Thanks, Tim
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Thanks Zoltan. This is definitely a tricky issue. Spark's application of localtime semantics to timestamp data has been a source of issues for many people. Personally I don't find that behavior to be particularly helpful since depending on the session time zone, you will get different results on data not marked as UTC-normalized. In pandas, as contrast, we apply UTC semantics to naive/not-explicitly-normalized data so at least the code produces deterministic results on all environments. That seems strictly better to me -- if you want a localized interpretation of naive data, you must opt into this rather than having it automatically selected based on your locale. The instances of people shooting their toes off due to time zones are practically non-existent, whereas I'm hearing about Spark gotchas all the time. On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi wrote: > > Hi Wes, > > The rules for TIMESTAMP forward-compatibility were created based on the > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used > in the instant aka. UTC-normalized semantics so far. This assumption was > supported by two sources: > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the Unix > epoch, an instant specified in UTC, from which it follows that they have > instant semantics (because timestamps of local semantics do not correspond > to a single instant). > > 2. Anecdotal knowledge: We were not aware of any software component that > used these types differently from the specification. > > Based on your e-mail, we were wrong on #2. > > From this false premise it followed that TIMESTAMPs with local semantics > were a new type and did not need to be annotated with the old types to > maintain compatibility. In fact, annotating them with the old types were > considered to be harmful, since it would have mislead older readers into > thinking that they can read TIMESTAMPs with local semantics, when in > reality they would have misinterpreted them as TIMESTAMPs with instant > semantics. This would have lead to a difference of several hours, > corresponding to the time zone offset. > > In the light of your e-mail, this misinterpretation of timestamps may > already be happening, since if Arrow annotates local timestamps with > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them as > timestamps with instant semantics, leading to a difference of several hours. > > Based on this, I think it would make sense from Arrow's point of view to > annotate both semantics with the old types, since that is its historical > behaviour and keeping it up is needed for maintaining compatibilty. I'm not > so sure about the Java library though, since as far as I know, these types > were never used in the local sense there (although I may be wrong again). > Were we to decide that Arrow and parquet-mr should behave differently in > this aspect though, it may be tricky to convey this distinction in the > specification. I would be interested in hearing your and other developers' > opinions on this. > > Thanks, > > Zoltan > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney wrote: > > > hi folks, > > > > We have just recently implemented the new LogicalType unions in the > > Parquet C++ library and we have run into a forward compatibility > > problem with reader versions prior to this implementation. > > > > To recap the issue, prior to the introduction of LogicalType, the > > Parquet format had no explicit notion of time zones or UTC > > normalization. The new TimestampType provides a flag to indicate > > UTC-normalization > > > > struct TimestampType { > > 1: required bool isAdjustedToUTC > > 2: required TimeUnit unit > > } > > > > When using this new type, the ConvertedType field must also be set for > > forward compatibility (so that old readers can still understand the > > data), but parquet.thrift says > > > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = > > true, unit = MICROS) > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = > > true, unit = MILLIS) > > 8: TimestampType TIMESTAMP > > > > In Apache Arrow, we have 2 varieties of timestamps: > > > > * Timestamp without time zone (no UTC normalization indicated) > > * Timestamp with time zone (values UTC-normalized) > > > > Prior to the introduction of LogicalType, we would set either > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC > > normalization. So when reading the data back, any notion of having had > > a time zone is lost (it could be stored in schema metadata if > > desired). > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when > > isAdjustedToUTC is true creates a forward compatibility break in this > > regard. This was reported to us shortly after releasing Apache Arrow > > 0.14.0: > > > > https://issues.apache.org/jira/browse/ARROW-5878 > > > > We are discussing setting the
Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
Hi Wes, The rules for TIMESTAMP forward-compatibility were created based on the assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used in the instant aka. UTC-normalized semantics so far. This assumption was supported by two sources: 1. The specification: parquet-format defined TIMESTAMP_MILLIS and TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the Unix epoch, an instant specified in UTC, from which it follows that they have instant semantics (because timestamps of local semantics do not correspond to a single instant). 2. Anecdotal knowledge: We were not aware of any software component that used these types differently from the specification. Based on your e-mail, we were wrong on #2. >From this false premise it followed that TIMESTAMPs with local semantics were a new type and did not need to be annotated with the old types to maintain compatibility. In fact, annotating them with the old types were considered to be harmful, since it would have mislead older readers into thinking that they can read TIMESTAMPs with local semantics, when in reality they would have misinterpreted them as TIMESTAMPs with instant semantics. This would have lead to a difference of several hours, corresponding to the time zone offset. In the light of your e-mail, this misinterpretation of timestamps may already be happening, since if Arrow annotates local timestamps with TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them as timestamps with instant semantics, leading to a difference of several hours. Based on this, I think it would make sense from Arrow's point of view to annotate both semantics with the old types, since that is its historical behaviour and keeping it up is needed for maintaining compatibilty. I'm not so sure about the Java library though, since as far as I know, these types were never used in the local sense there (although I may be wrong again). Were we to decide that Arrow and parquet-mr should behave differently in this aspect though, it may be tricky to convey this distinction in the specification. I would be interested in hearing your and other developers' opinions on this. Thanks, Zoltan On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney wrote: > hi folks, > > We have just recently implemented the new LogicalType unions in the > Parquet C++ library and we have run into a forward compatibility > problem with reader versions prior to this implementation. > > To recap the issue, prior to the introduction of LogicalType, the > Parquet format had no explicit notion of time zones or UTC > normalization. The new TimestampType provides a flag to indicate > UTC-normalization > > struct TimestampType { > 1: required bool isAdjustedToUTC > 2: required TimeUnit unit > } > > When using this new type, the ConvertedType field must also be set for > forward compatibility (so that old readers can still understand the > data), but parquet.thrift says > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = > true, unit = MICROS) > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = > true, unit = MILLIS) > 8: TimestampType TIMESTAMP > > In Apache Arrow, we have 2 varieties of timestamps: > > * Timestamp without time zone (no UTC normalization indicated) > * Timestamp with time zone (values UTC-normalized) > > Prior to the introduction of LogicalType, we would set either > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC > normalization. So when reading the data back, any notion of having had > a time zone is lost (it could be stored in schema metadata if > desired). > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when > isAdjustedToUTC is true creates a forward compatibility break in this > regard. This was reported to us shortly after releasing Apache Arrow > 0.14.0: > > https://issues.apache.org/jira/browse/ARROW-5878 > > We are discussing setting the ConvertedType unconditionally in > > https://github.com/apache/arrow/pull/4825 > > This might need to be a setting that is toggled when data is coming > from Arrow, but I wonder if the text in parquet.thrift is the intended > forward compatibility interpretation, and if not should we amend. > > Thanks, > Wes >
Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType
hi folks, We have just recently implemented the new LogicalType unions in the Parquet C++ library and we have run into a forward compatibility problem with reader versions prior to this implementation. To recap the issue, prior to the introduction of LogicalType, the Parquet format had no explicit notion of time zones or UTC normalization. The new TimestampType provides a flag to indicate UTC-normalization struct TimestampType { 1: required bool isAdjustedToUTC 2: required TimeUnit unit } When using this new type, the ConvertedType field must also be set for forward compatibility (so that old readers can still understand the data), but parquet.thrift says // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = true, unit = MICROS) // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = true, unit = MILLIS) 8: TimestampType TIMESTAMP In Apache Arrow, we have 2 varieties of timestamps: * Timestamp without time zone (no UTC normalization indicated) * Timestamp with time zone (values UTC-normalized) Prior to the introduction of LogicalType, we would set either TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC normalization. So when reading the data back, any notion of having had a time zone is lost (it could be stored in schema metadata if desired). I believe that setting the TIMESTAMP_* ConvertedType _only_ when isAdjustedToUTC is true creates a forward compatibility break in this regard. This was reported to us shortly after releasing Apache Arrow 0.14.0: https://issues.apache.org/jira/browse/ARROW-5878 We are discussing setting the ConvertedType unconditionally in https://github.com/apache/arrow/pull/4825 This might need to be a setting that is toggled when data is coming from Arrow, but I wonder if the text in parquet.thrift is the intended forward compatibility interpretation, and if not should we amend. Thanks, Wes