Sorry for the quick self-reply, but after brief reflection I think two changes to my alternative proposal are required:
1. The proposed new field should be a parameter to the TimestampType, not FileMetaData -- file level adds unnecessary complication / opportunities for mischief. 2. Although reported vs. inferred is the logical distinction, practically this change is about whether or not the TimestampType was built from a TIMESTAMP converted type, so the name should reflect that. After these amendments, this option boils down to: add a new boolean parameter to the TimestampType named (something like) fromConvertedType. --TPB On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau <[email protected]> wrote: > Hi Zoltan, > > Thank you for the helpful clarification of the community's understanding > of the TIMESTAMP annotation. > > The core of the problem (IMHO) is that there no way to distinguish in the > new LogicalType TimestampType between the case where UTC-normalization has > been directly reported (via a user supplied TimestampType boolean > parameter) or merely inferred (from a naked TIMESTAMP converted type). So > perhaps another alternative might be to retain the isAdjustedToUTC boolean > as is and add another field that indicates whether the adjusted flag was > REPORTED or INFERRED (could be boolean or short variant or some type). > This would allow interested readers to differentiate between old and new > timestamps, while allowing other readers to enjoy the default you believe > is warranted. It seems most straightforward that it would be an additional > parameter on the TimestampType, but I supposed it could reside in the > FileMetaData struct (on the assumption that the schema elements, having > been written by the same writer, all uniformly use converted type or > LogicalType). > > --Tim > > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi <[email protected]> > wrote: > >> Hi Tim, >> >> In my opinion the specification of the older timestamp types only allowed >> UTC-normalized storage, since these types were defined as the number of >> milli/microseconds elapsed since the Unix epoch. This clearly defines the >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. >> 1970-01-01 00:00:00 UTC. It does not say anything about how this value >> must >> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but >> typically it is displayed adjusted to the user's local timezone, for >> example "1970-01-01 01:00:00" for a user in Paris. I don't think this >> definition allows interpreting the numeric value 0 as "1970-01-01 >> 00:00:00" >> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC, >> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or >> 10^6 for _MICROS) instead. >> >> I realize that compatibility with real-life usage patterns is important >> regardless of whether they comply with the specification or not, but I >> can't think of any solution that would be useful in practice. The >> suggestion to turn the boolean into an enum would certainly allow Parquet >> to have timestamps with unknown semantics, but I don't know what value >> that >> would bring to applications and how they would use it. I'm also afraid >> that >> the undefined semantics would get misused/overused by developers who are >> not sure about the difference between the two semantics and we would end >> up >> with a lot of meaningless timestamps. >> >> Even with the problems I listed your suggestion may still be better than >> the current solution, but before making a community decision I would like >> to continue this discussion focusing on three questions: >> >> - What are the implications of this change? >> - How will unknown semantics be used in practice? >> - Does it bring value? >> - Can we do better? >> - Can we even change the boolean to an enum? It has been specified like >> that and released a long time ago. Although I am not aware of any >> software >> component that would have already implemented it, I was also unaware of >> software components using TIMESTAMP_MILLIS and _MICROS for local >> semantics. >> >> One alternative that comes to my mind is to default to the more common >> UTC-normalized semantics but allow overriding it in the reader schema. >> >> Thanks, >> >> Zoltan >> >> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]> wrote: >> >> > I'm not a long-time Parquet user, but I assisted in the expansion of the >> > parquet-cpp library's LogicalType facility. >> > >> > My impression is that the original TIMESTAMP converted types were >> silent on >> > whether the annotated value was UTC adjusted and that (often arcane) >> > out-of-band information had to be relied on by readers to decide the UTC >> > adjustment status for timestamp columns. It seemed to me that that >> > perceived shortcoming was a primary motivator for adding the >> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp >> > LogicalType. If that impression is accurate, then when reading >> TIMESTAMP >> > columns written by legacy (converted type only) writers, it seems >> > inappropriate for LogicalType aware readers to unconditionally assign >> > *either* "false" or "true" (as currently required) to a boolean UTC >> > adjusted parameter, as that requires the reader to infer a property that >> > wasn't implied by the writer. >> > >> > One possible approach to untangling this might be to amend the >> > parquet.thrift specification to change the isAdjustedToUTC boolean >> property >> > to an enum or union type (some enumerated list) named (for example) >> > UTCAdjustment with three possible values: Unknown, UTCAdjusted, >> > NotUTCAdjusted (I'm not married to the names). Extant files with >> TIMESTAMP >> > converted types only would map for forward compatibility to Timestamp >> > LogicalTypes with UTCAdjustment:=Unknown . New files with user supplied >> > Timestamp LogicalTypes would always record the converted type as >> TIMESTAMP >> > for backward compatibility regardless of the value of the new >> UTCAdjustment >> > parameter (this would be lossy on a round-trip through a legacy library, >> > but that's unavoidable -- and the legacy libraries would be no worse off >> > than they are now). The specification would normatively state that new >> > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either >> > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new >> > files). >> > >> > Thanks, Tim >> > >> >
