Sorry for the quick self-reply, but after brief reflection I think two
changes to my alternative proposal are required:

1.  The proposed new field should be a parameter to the TimestampType, not
FileMetaData -- file level adds unnecessary complication / opportunities
for mischief.
2.  Although reported vs. inferred is the logical distinction, practically
this change is about whether or not the TimestampType was built from a
TIMESTAMP converted type, so the name should reflect that.

After these amendments, this option boils down to: add a new boolean
parameter to the TimestampType named (something like) fromConvertedType.

--TPB

On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau <[email protected]> wrote:

> Hi Zoltan,
>
> Thank you for the helpful clarification of the community's understanding
> of the TIMESTAMP annotation.
>
> The core of the problem (IMHO) is that there no way to distinguish in the
> new LogicalType TimestampType between the case where UTC-normalization has
> been directly reported (via a user supplied TimestampType boolean
> parameter) or merely inferred (from a naked TIMESTAMP converted type).  So
> perhaps another alternative might be to retain the isAdjustedToUTC boolean
> as is and add another field that indicates whether the adjusted flag was
> REPORTED or INFERRED (could be boolean or short variant or some type).
> This would allow interested readers to differentiate between old and new
> timestamps, while allowing other readers to enjoy the default you believe
> is warranted.  It seems most straightforward that it would be an additional
> parameter on the TimestampType, but I supposed it could reside in the
> FileMetaData struct (on the assumption that the schema elements, having
> been written by the same writer, all uniformly use converted type or
> LogicalType).
>
> --Tim
>
>
> On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi <[email protected]>
> wrote:
>
>> Hi Tim,
>>
>> In my opinion the specification of the older timestamp types only allowed
>> UTC-normalized storage, since these types were defined as the number of
>> milli/microseconds elapsed since the Unix epoch. This clearly defines the
>> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e.
>> 1970-01-01 00:00:00 UTC. It does not say anything about how this value
>> must
>> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but
>> typically it is displayed adjusted to the user's local timezone, for
>> example "1970-01-01 01:00:00" for a user in Paris. I don't think this
>> definition allows interpreting the numeric value 0 as "1970-01-01
>> 00:00:00"
>> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC,
>> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or
>> 10^6 for _MICROS) instead.
>>
>> I realize that compatibility with real-life usage patterns is important
>> regardless of whether they comply with the specification or not, but I
>> can't think of any solution that would be useful in practice. The
>> suggestion to turn the boolean into an enum would certainly allow Parquet
>> to have timestamps with unknown semantics, but I don't know what value
>> that
>> would bring to applications and how they would use it. I'm also afraid
>> that
>> the undefined semantics would get misused/overused by developers who are
>> not sure about the difference between the two semantics and we would end
>> up
>> with a lot of meaningless timestamps.
>>
>> Even with the problems I listed your suggestion may still be better than
>> the current solution, but before making a community decision I would like
>> to continue this discussion focusing on three questions:
>>
>>    - What are the implications of this change?
>>    - How will unknown semantics be used in practice?
>>    - Does it bring value?
>>    - Can we do better?
>>    - Can we even change the boolean to an enum? It has been specified like
>>    that and released a long time ago. Although I am not aware of any
>> software
>>    component that would have already implemented it, I was also unaware of
>>    software components using TIMESTAMP_MILLIS and _MICROS for local
>> semantics.
>>
>> One alternative that comes to my mind is to default to the more common
>> UTC-normalized semantics but allow overriding it in the reader schema.
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]> wrote:
>>
>> > I'm not a long-time Parquet user, but I assisted in the expansion of the
>> > parquet-cpp library's LogicalType facility.
>> >
>> > My impression is that the original TIMESTAMP converted types were
>> silent on
>> > whether the annotated value was UTC adjusted and that (often arcane)
>> > out-of-band information had to be relied on by readers to decide the UTC
>> > adjustment status for timestamp columns.  It seemed to me that that
>> > perceived shortcoming was a primary motivator for adding the
>> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp
>> > LogicalType.  If that impression is accurate, then when reading
>> TIMESTAMP
>> > columns written by legacy (converted type only) writers, it seems
>> > inappropriate for LogicalType aware readers to unconditionally assign
>> > *either* "false" or "true" (as currently required) to a boolean UTC
>> > adjusted parameter, as that requires the reader to infer a property that
>> > wasn't implied by the writer.
>> >
>> > One possible approach to untangling this might be to amend the
>> > parquet.thrift specification to change the isAdjustedToUTC boolean
>> property
>> > to an enum or union type (some enumerated list) named (for example)
>> > UTCAdjustment with three possible values: Unknown, UTCAdjusted,
>> > NotUTCAdjusted (I'm not married to the names).  Extant files with
>> TIMESTAMP
>> > converted types only would map for forward compatibility to Timestamp
>> > LogicalTypes with UTCAdjustment:=Unknown .  New files with user supplied
>> > Timestamp LogicalTypes would always record the converted type as
>> TIMESTAMP
>> > for backward compatibility regardless of the value of the new
>> UTCAdjustment
>> > parameter (this would be lossy on a round-trip through a legacy library,
>> > but that's unavoidable -- and the legacy libraries would be no worse off
>> > than they are now).  The specification would normatively state that new
>> > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either
>> > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new
>> > files).
>> >
>> > Thanks, Tim
>> >
>>
>

Reply via email to