Hi Tim,

In my opinion the specification of the older timestamp types only allowed
UTC-normalized storage, since these types were defined as the number of
milli/microseconds elapsed since the Unix epoch. This clearly defines the
meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e.
1970-01-01 00:00:00 UTC. It does not say anything about how this value must
be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but
typically it is displayed adjusted to the user's local timezone, for
example "1970-01-01 01:00:00" for a user in Paris. I don't think this
definition allows interpreting the numeric value 0 as "1970-01-01 00:00:00"
in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC,
which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or
10^6 for _MICROS) instead.

I realize that compatibility with real-life usage patterns is important
regardless of whether they comply with the specification or not, but I
can't think of any solution that would be useful in practice. The
suggestion to turn the boolean into an enum would certainly allow Parquet
to have timestamps with unknown semantics, but I don't know what value that
would bring to applications and how they would use it. I'm also afraid that
the undefined semantics would get misused/overused by developers who are
not sure about the difference between the two semantics and we would end up
with a lot of meaningless timestamps.

Even with the problems I listed your suggestion may still be better than
the current solution, but before making a community decision I would like
to continue this discussion focusing on three questions:

   - What are the implications of this change?
   - How will unknown semantics be used in practice?
   - Does it bring value?
   - Can we do better?
   - Can we even change the boolean to an enum? It has been specified like
   that and released a long time ago. Although I am not aware of any software
   component that would have already implemented it, I was also unaware of
   software components using TIMESTAMP_MILLIS and _MICROS for local semantics.

One alternative that comes to my mind is to default to the more common
UTC-normalized semantics but allow overriding it in the reader schema.

Thanks,

Zoltan

On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]> wrote:

> I'm not a long-time Parquet user, but I assisted in the expansion of the
> parquet-cpp library's LogicalType facility.
>
> My impression is that the original TIMESTAMP converted types were silent on
> whether the annotated value was UTC adjusted and that (often arcane)
> out-of-band information had to be relied on by readers to decide the UTC
> adjustment status for timestamp columns.  It seemed to me that that
> perceived shortcoming was a primary motivator for adding the
> isAdjustedToUTC boolean parameter to the corresponding new Timestamp
> LogicalType.  If that impression is accurate, then when reading TIMESTAMP
> columns written by legacy (converted type only) writers, it seems
> inappropriate for LogicalType aware readers to unconditionally assign
> *either* "false" or "true" (as currently required) to a boolean UTC
> adjusted parameter, as that requires the reader to infer a property that
> wasn't implied by the writer.
>
> One possible approach to untangling this might be to amend the
> parquet.thrift specification to change the isAdjustedToUTC boolean property
> to an enum or union type (some enumerated list) named (for example)
> UTCAdjustment with three possible values: Unknown, UTCAdjusted,
> NotUTCAdjusted (I'm not married to the names).  Extant files with TIMESTAMP
> converted types only would map for forward compatibility to Timestamp
> LogicalTypes with UTCAdjustment:=Unknown .  New files with user supplied
> Timestamp LogicalTypes would always record the converted type as TIMESTAMP
> for backward compatibility regardless of the value of the new UTCAdjustment
> parameter (this would be lossy on a round-trip through a legacy library,
> but that's unavoidable -- and the legacy libraries would be no worse off
> than they are now).  The specification would normatively state that new
> user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either
> UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new
> files).
>
> Thanks, Tim
>

Reply via email to