Hi Tim, I was just told that arrow 0.14.0 is already out and uses the new logical types. This means that your original suggestion of changing the boolean to an enum is not really a viable option. Regarding your second suggestion, an additional field may be added to parquet-format, but I'm still concerned that it would be confusing and that it would provide little value in practice. However, thinking more about the problem I realized that we do not have to add anything to parquet-format to record the uncertainty of timestamp semantics, since we already have a way for doing so by using the legacy timestamp types. It is enough to expose the possiblity of having undefined semantics in the API. This can be deduced from the logical types stored in the file in the following way:
Logical type in file => Logical type in API ----------------------------------------------------------- new TS type with semantics A => TS with semantics A new TS type with semantics B => TS with semantics B legacy TS type => TS with unknown semantics For example, TIMESTAMP_MICROS would map to unknown semantics in the API while TIMESTAMP(isAdjustedToUTC=true, unit=MICROS) would map to instant semantics. Please let me know your thoughts on this suggestion. Thanks, Zoltan On Wed, Jul 10, 2019 at 6:43 PM TP Boudreau <[email protected]> wrote: > > Sorry for the quick self-reply, but after brief reflection I think two > changes to my alternative proposal are required: > > 1. The proposed new field should be a parameter to the TimestampType, not > FileMetaData -- file level adds unnecessary complication / opportunities > for mischief. > 2. Although reported vs. inferred is the logical distinction, practically > this change is about whether or not the TimestampType was built from a > TIMESTAMP converted type, so the name should reflect that. > > After these amendments, this option boils down to: add a new boolean > parameter to the TimestampType named (something like) fromConvertedType. > > --TPB > > On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau <[email protected]> wrote: > > > Hi Zoltan, > > > > Thank you for the helpful clarification of the community's understanding > > of the TIMESTAMP annotation. > > > > The core of the problem (IMHO) is that there no way to distinguish in the > > new LogicalType TimestampType between the case where UTC-normalization has > > been directly reported (via a user supplied TimestampType boolean > > parameter) or merely inferred (from a naked TIMESTAMP converted type). So > > perhaps another alternative might be to retain the isAdjustedToUTC boolean > > as is and add another field that indicates whether the adjusted flag was > > REPORTED or INFERRED (could be boolean or short variant or some type). > > This would allow interested readers to differentiate between old and new > > timestamps, while allowing other readers to enjoy the default you believe > > is warranted. It seems most straightforward that it would be an additional > > parameter on the TimestampType, but I supposed it could reside in the > > FileMetaData struct (on the assumption that the schema elements, having > > been written by the same writer, all uniformly use converted type or > > LogicalType). > > > > --Tim > > > > > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi <[email protected]> > > wrote: > > > >> Hi Tim, > >> > >> In my opinion the specification of the older timestamp types only allowed > >> UTC-normalized storage, since these types were defined as the number of > >> milli/microseconds elapsed since the Unix epoch. This clearly defines the > >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. > >> 1970-01-01 00:00:00 UTC. It does not say anything about how this value > >> must > >> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but > >> typically it is displayed adjusted to the user's local timezone, for > >> example "1970-01-01 01:00:00" for a user in Paris. I don't think this > >> definition allows interpreting the numeric value 0 as "1970-01-01 > >> 00:00:00" > >> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, > >> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or > >> 10^6 for _MICROS) instead. > >> > >> I realize that compatibility with real-life usage patterns is important > >> regardless of whether they comply with the specification or not, but I > >> can't think of any solution that would be useful in practice. The > >> suggestion to turn the boolean into an enum would certainly allow Parquet > >> to have timestamps with unknown semantics, but I don't know what value > >> that > >> would bring to applications and how they would use it. I'm also afraid > >> that > >> the undefined semantics would get misused/overused by developers who are > >> not sure about the difference between the two semantics and we would end > >> up > >> with a lot of meaningless timestamps. > >> > >> Even with the problems I listed your suggestion may still be better than > >> the current solution, but before making a community decision I would like > >> to continue this discussion focusing on three questions: > >> > >> - What are the implications of this change? > >> - How will unknown semantics be used in practice? > >> - Does it bring value? > >> - Can we do better? > >> - Can we even change the boolean to an enum? It has been specified like > >> that and released a long time ago. Although I am not aware of any > >> software > >> component that would have already implemented it, I was also unaware of > >> software components using TIMESTAMP_MILLIS and _MICROS for local > >> semantics. > >> > >> One alternative that comes to my mind is to default to the more common > >> UTC-normalized semantics but allow overriding it in the reader schema. > >> > >> Thanks, > >> > >> Zoltan > >> > >> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]> wrote: > >> > >> > I'm not a long-time Parquet user, but I assisted in the expansion of the > >> > parquet-cpp library's LogicalType facility. > >> > > >> > My impression is that the original TIMESTAMP converted types were > >> silent on > >> > whether the annotated value was UTC adjusted and that (often arcane) > >> > out-of-band information had to be relied on by readers to decide the UTC > >> > adjustment status for timestamp columns. It seemed to me that that > >> > perceived shortcoming was a primary motivator for adding the > >> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp > >> > LogicalType. If that impression is accurate, then when reading > >> TIMESTAMP > >> > columns written by legacy (converted type only) writers, it seems > >> > inappropriate for LogicalType aware readers to unconditionally assign > >> > *either* "false" or "true" (as currently required) to a boolean UTC > >> > adjusted parameter, as that requires the reader to infer a property that > >> > wasn't implied by the writer. > >> > > >> > One possible approach to untangling this might be to amend the > >> > parquet.thrift specification to change the isAdjustedToUTC boolean > >> property > >> > to an enum or union type (some enumerated list) named (for example) > >> > UTCAdjustment with three possible values: Unknown, UTCAdjusted, > >> > NotUTCAdjusted (I'm not married to the names). Extant files with > >> TIMESTAMP > >> > converted types only would map for forward compatibility to Timestamp > >> > LogicalTypes with UTCAdjustment:=Unknown . New files with user supplied > >> > Timestamp LogicalTypes would always record the converted type as > >> TIMESTAMP > >> > for backward compatibility regardless of the value of the new > >> UTCAdjustment > >> > parameter (this would be lossy on a round-trip through a legacy library, > >> > but that's unavoidable -- and the legacy libraries would be no worse off > >> > than they are now). The specification would normatively state that new > >> > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either > >> > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new > >> > files). > >> > > >> > Thanks, Tim > >> > > >> > >
