Hi Zoltan, Thank you for your engagement on this issue, it's much appreciated.
The upcoming release of the parquet-cpp library takes essentially the approach you outline. When a Timestamp logical type is created a from a legacy converted TIMESTAMP_* value (your third case), the Timestamp's UTC-normalized property is set to "true" (per the backward compatibility requirements in the spec) but an additional property indicating that the LogicalType was derived from a ConvertedType is also set to "true" and exposed via an accessor. So a library user can choose to accept the default UTC normalization inference or override it by checking the new flag. Similarly, for forward compatibility, the default behavior is to produce a TIMESTAMP_* converted type only if the Timestamp logical type is UTC normalized (again conforming to the spec), but the Timestamp LogicalType API allows the library user to force a TIMESTAMP converted type even for non-UTC normalized Timestamps. Given these API options, I tend to agree with you that the need for file format changes is substantially reduced or eliminated. Thanks, Tim On Thu, Jul 11, 2019 at 6:41 AM Zoltan Ivanfi <[email protected]> wrote: > Hi Tim, > > I was just told that arrow 0.14.0 is already out and uses the new > logical types. This means that your original suggestion of changing > the boolean to an enum is not really a viable option. Regarding your > second suggestion, an additional field may be added to parquet-format, > but I'm still concerned that it would be confusing and that it would > provide little value in practice. However, thinking more about the > problem I realized that we do not have to add anything to > parquet-format to record the uncertainty of timestamp semantics, since > we already have a way for doing so by using the legacy timestamp > types. It is enough to expose the possiblity of having undefined > semantics in the API. This can be deduced from the logical types > stored in the file in the following way: > > Logical type in file => Logical type in API > ----------------------------------------------------------- > new TS type with semantics A => TS with semantics A > new TS type with semantics B => TS with semantics B > legacy TS type => TS with unknown semantics > > For example, TIMESTAMP_MICROS would map to unknown semantics in the > API while TIMESTAMP(isAdjustedToUTC=true, unit=MICROS) would map to > instant semantics. Please let me know your thoughts on this > suggestion. > > Thanks, > > Zoltan > > On Wed, Jul 10, 2019 at 6:43 PM TP Boudreau <[email protected]> wrote: > > > > Sorry for the quick self-reply, but after brief reflection I think two > > changes to my alternative proposal are required: > > > > 1. The proposed new field should be a parameter to the TimestampType, > not > > FileMetaData -- file level adds unnecessary complication / opportunities > > for mischief. > > 2. Although reported vs. inferred is the logical distinction, > practically > > this change is about whether or not the TimestampType was built from a > > TIMESTAMP converted type, so the name should reflect that. > > > > After these amendments, this option boils down to: add a new boolean > > parameter to the TimestampType named (something like) fromConvertedType. > > > > --TPB > > > > On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau <[email protected]> > wrote: > > > > > Hi Zoltan, > > > > > > Thank you for the helpful clarification of the community's > understanding > > > of the TIMESTAMP annotation. > > > > > > The core of the problem (IMHO) is that there no way to distinguish in > the > > > new LogicalType TimestampType between the case where UTC-normalization > has > > > been directly reported (via a user supplied TimestampType boolean > > > parameter) or merely inferred (from a naked TIMESTAMP converted > type). So > > > perhaps another alternative might be to retain the isAdjustedToUTC > boolean > > > as is and add another field that indicates whether the adjusted flag > was > > > REPORTED or INFERRED (could be boolean or short variant or some type). > > > This would allow interested readers to differentiate between old and > new > > > timestamps, while allowing other readers to enjoy the default you > believe > > > is warranted. It seems most straightforward that it would be an > additional > > > parameter on the TimestampType, but I supposed it could reside in the > > > FileMetaData struct (on the assumption that the schema elements, having > > > been written by the same writer, all uniformly use converted type or > > > LogicalType). > > > > > > --Tim > > > > > > > > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi <[email protected] > > > > > wrote: > > > > > >> Hi Tim, > > >> > > >> In my opinion the specification of the older timestamp types only > allowed > > >> UTC-normalized storage, since these types were defined as the number > of > > >> milli/microseconds elapsed since the Unix epoch. This clearly defines > the > > >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e. > > >> 1970-01-01 00:00:00 UTC. It does not say anything about how this value > > >> must > > >> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", > but > > >> typically it is displayed adjusted to the user's local timezone, for > > >> example "1970-01-01 01:00:00" for a user in Paris. I don't think this > > >> definition allows interpreting the numeric value 0 as "1970-01-01 > > >> 00:00:00" > > >> in Paris, since the latter would correspond to 1969-12-31 23:00:00 > UTC, > > >> which must be stored as the numeric value -3600 (times 10^3 for > _MILLIS or > > >> 10^6 for _MICROS) instead. > > >> > > >> I realize that compatibility with real-life usage patterns is > important > > >> regardless of whether they comply with the specification or not, but I > > >> can't think of any solution that would be useful in practice. The > > >> suggestion to turn the boolean into an enum would certainly allow > Parquet > > >> to have timestamps with unknown semantics, but I don't know what value > > >> that > > >> would bring to applications and how they would use it. I'm also afraid > > >> that > > >> the undefined semantics would get misused/overused by developers who > are > > >> not sure about the difference between the two semantics and we would > end > > >> up > > >> with a lot of meaningless timestamps. > > >> > > >> Even with the problems I listed your suggestion may still be better > than > > >> the current solution, but before making a community decision I would > like > > >> to continue this discussion focusing on three questions: > > >> > > >> - What are the implications of this change? > > >> - How will unknown semantics be used in practice? > > >> - Does it bring value? > > >> - Can we do better? > > >> - Can we even change the boolean to an enum? It has been specified > like > > >> that and released a long time ago. Although I am not aware of any > > >> software > > >> component that would have already implemented it, I was also > unaware of > > >> software components using TIMESTAMP_MILLIS and _MICROS for local > > >> semantics. > > >> > > >> One alternative that comes to my mind is to default to the more common > > >> UTC-normalized semantics but allow overriding it in the reader schema. > > >> > > >> Thanks, > > >> > > >> Zoltan > > >> > > >> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]> > wrote: > > >> > > >> > I'm not a long-time Parquet user, but I assisted in the expansion > of the > > >> > parquet-cpp library's LogicalType facility. > > >> > > > >> > My impression is that the original TIMESTAMP converted types were > > >> silent on > > >> > whether the annotated value was UTC adjusted and that (often arcane) > > >> > out-of-band information had to be relied on by readers to decide > the UTC > > >> > adjustment status for timestamp columns. It seemed to me that that > > >> > perceived shortcoming was a primary motivator for adding the > > >> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp > > >> > LogicalType. If that impression is accurate, then when reading > > >> TIMESTAMP > > >> > columns written by legacy (converted type only) writers, it seems > > >> > inappropriate for LogicalType aware readers to unconditionally > assign > > >> > *either* "false" or "true" (as currently required) to a boolean UTC > > >> > adjusted parameter, as that requires the reader to infer a property > that > > >> > wasn't implied by the writer. > > >> > > > >> > One possible approach to untangling this might be to amend the > > >> > parquet.thrift specification to change the isAdjustedToUTC boolean > > >> property > > >> > to an enum or union type (some enumerated list) named (for example) > > >> > UTCAdjustment with three possible values: Unknown, UTCAdjusted, > > >> > NotUTCAdjusted (I'm not married to the names). Extant files with > > >> TIMESTAMP > > >> > converted types only would map for forward compatibility to > Timestamp > > >> > LogicalTypes with UTCAdjustment:=Unknown . New files with user > supplied > > >> > Timestamp LogicalTypes would always record the converted type as > > >> TIMESTAMP > > >> > for backward compatibility regardless of the value of the new > > >> UTCAdjustment > > >> > parameter (this would be lossy on a round-trip through a legacy > library, > > >> > but that's unavoidable -- and the legacy libraries would be no > worse off > > >> > than they are now). The specification would normatively state that > new > > >> > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either > > >> > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in > new > > >> > files). > > >> > > > >> > Thanks, Tim > > >> > > > >> > > > >
