I'm not a long-time Parquet user, but I assisted in the expansion of the
parquet-cpp library's LogicalType facility.

My impression is that the original TIMESTAMP converted types were silent on
whether the annotated value was UTC adjusted and that (often arcane)
out-of-band information had to be relied on by readers to decide the UTC
adjustment status for timestamp columns.  It seemed to me that that
perceived shortcoming was a primary motivator for adding the
isAdjustedToUTC boolean parameter to the corresponding new Timestamp
LogicalType.  If that impression is accurate, then when reading TIMESTAMP
columns written by legacy (converted type only) writers, it seems
inappropriate for LogicalType aware readers to unconditionally assign
*either* "false" or "true" (as currently required) to a boolean UTC
adjusted parameter, as that requires the reader to infer a property that
wasn't implied by the writer.

One possible approach to untangling this might be to amend the
parquet.thrift specification to change the isAdjustedToUTC boolean property
to an enum or union type (some enumerated list) named (for example)
UTCAdjustment with three possible values: Unknown, UTCAdjusted,
NotUTCAdjusted (I'm not married to the names).  Extant files with TIMESTAMP
converted types only would map for forward compatibility to Timestamp
LogicalTypes with UTCAdjustment:=Unknown .  New files with user supplied
Timestamp LogicalTypes would always record the converted type as TIMESTAMP
for backward compatibility regardless of the value of the new UTCAdjustment
parameter (this would be lossy on a round-trip through a legacy library,
but that's unavoidable -- and the legacy libraries would be no worse off
than they are now).  The specification would normatively state that new
user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either
UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new
files).

Thanks, Tim

Reply via email to