Hi Tim, In my opinion the specification of the older timestamp types only allowed UTC-normalized storage, since these types were defined as the number of milli/microseconds elapsed since the Unix epoch. This clearly defines the meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. 1970-01-01 00:00:00 UTC. It does not say anything about how this value must be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but typically it is displayed adjusted to the user's local timezone, for example "1970-01-01 01:00:00" for a user in Paris. I don't think this definition allows interpreting the numeric value 0 as "1970-01-01 00:00:00" in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or 10^6 for _MICROS) instead.
I realize that compatibility with real-life usage patterns is important regardless of whether they comply with the specification or not, but I can't think of any solution that would be useful in practice. The suggestion to turn the boolean into an enum would certainly allow Parquet to have timestamps with unknown semantics, but I don't know what value that would bring to applications and how they would use it. I'm also afraid that the undefined semantics would get misused/overused by developers who are not sure about the difference between the two semantics and we would end up with a lot of meaningless timestamps. Even with the problems I listed your suggestion may still be better than the current solution, but before making a community decision I would like to continue this discussion focusing on three questions: - What are the implications of this change? - How will unknown semantics be used in practice? - Does it bring value? - Can we do better? - Can we even change the boolean to an enum? It has been specified like that and released a long time ago. Although I am not aware of any software component that would have already implemented it, I was also unaware of software components using TIMESTAMP_MILLIS and _MICROS for local semantics. One alternative that comes to my mind is to default to the more common UTC-normalized semantics but allow overriding it in the reader schema. Thanks, Zoltan On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]> wrote: > I'm not a long-time Parquet user, but I assisted in the expansion of the > parquet-cpp library's LogicalType facility. > > My impression is that the original TIMESTAMP converted types were silent on > whether the annotated value was UTC adjusted and that (often arcane) > out-of-band information had to be relied on by readers to decide the UTC > adjustment status for timestamp columns. It seemed to me that that > perceived shortcoming was a primary motivator for adding the > isAdjustedToUTC boolean parameter to the corresponding new Timestamp > LogicalType. If that impression is accurate, then when reading TIMESTAMP > columns written by legacy (converted type only) writers, it seems > inappropriate for LogicalType aware readers to unconditionally assign > *either* "false" or "true" (as currently required) to a boolean UTC > adjusted parameter, as that requires the reader to infer a property that > wasn't implied by the writer. > > One possible approach to untangling this might be to amend the > parquet.thrift specification to change the isAdjustedToUTC boolean property > to an enum or union type (some enumerated list) named (for example) > UTCAdjustment with three possible values: Unknown, UTCAdjusted, > NotUTCAdjusted (I'm not married to the names). Extant files with TIMESTAMP > converted types only would map for forward compatibility to Timestamp > LogicalTypes with UTCAdjustment:=Unknown . New files with user supplied > Timestamp LogicalTypes would always record the converted type as TIMESTAMP > for backward compatibility regardless of the value of the new UTCAdjustment > parameter (this would be lossy on a round-trip through a legacy library, > but that's unavoidable -- and the legacy libraries would be no worse off > than they are now). The specification would normatively state that new > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new > files). > > Thanks, Tim >
