Hi Zoltan,

Thank you for your engagement on this issue, it's much appreciated.

The upcoming release of the parquet-cpp library takes essentially the
approach you outline.  When a Timestamp logical type is created a from a
legacy converted TIMESTAMP_* value (your third case), the Timestamp's
UTC-normalized property is set to "true" (per the backward compatibility
requirements in the spec) but an additional property indicating that the
LogicalType was derived from a ConvertedType is also set to "true" and
exposed via an accessor.  So a library user can choose to accept the
default UTC normalization inference or override it by checking the new flag.

Similarly, for forward compatibility, the default behavior is to produce a
TIMESTAMP_* converted type only if the Timestamp logical type is UTC
normalized (again conforming to the spec), but the Timestamp LogicalType
API allows the library user to force a TIMESTAMP converted type even for
non-UTC normalized Timestamps.

Given these API options, I tend to agree with you that the need for file
format changes is substantially reduced or eliminated.

Thanks,
Tim

On Thu, Jul 11, 2019 at 6:41 AM Zoltan Ivanfi <[email protected]>
wrote:

> Hi Tim,
>
> I was just told that arrow 0.14.0 is already out and uses the new
> logical types. This means that your original suggestion of changing
> the boolean to an enum is not really a viable option. Regarding your
> second suggestion, an additional field may be added to parquet-format,
> but I'm still concerned that it would be confusing and that it would
> provide little value in practice. However, thinking more about the
> problem I realized that we do not have to add anything to
> parquet-format to record the uncertainty of timestamp semantics, since
> we already have a way for doing so by using the legacy timestamp
> types. It is enough to expose the possiblity of having undefined
> semantics in the API. This can be deduced from the logical types
> stored in the file in the following way:
>
> Logical type in file          =>  Logical type in API
> -----------------------------------------------------------
> new TS type with semantics A  =>  TS with semantics A
> new TS type with semantics B  =>  TS with semantics B
> legacy TS type                =>  TS with unknown semantics
>
> For example, TIMESTAMP_MICROS would map to unknown semantics in the
> API while TIMESTAMP(isAdjustedToUTC=true, unit=MICROS) would map to
> instant semantics. Please let me know your thoughts on this
> suggestion.
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 6:43 PM TP Boudreau <[email protected]> wrote:
> >
> > Sorry for the quick self-reply, but after brief reflection I think two
> > changes to my alternative proposal are required:
> >
> > 1.  The proposed new field should be a parameter to the TimestampType,
> not
> > FileMetaData -- file level adds unnecessary complication / opportunities
> > for mischief.
> > 2.  Although reported vs. inferred is the logical distinction,
> practically
> > this change is about whether or not the TimestampType was built from a
> > TIMESTAMP converted type, so the name should reflect that.
> >
> > After these amendments, this option boils down to: add a new boolean
> > parameter to the TimestampType named (something like) fromConvertedType.
> >
> > --TPB
> >
> > On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau <[email protected]>
> wrote:
> >
> > > Hi Zoltan,
> > >
> > > Thank you for the helpful clarification of the community's
> understanding
> > > of the TIMESTAMP annotation.
> > >
> > > The core of the problem (IMHO) is that there no way to distinguish in
> the
> > > new LogicalType TimestampType between the case where UTC-normalization
> has
> > > been directly reported (via a user supplied TimestampType boolean
> > > parameter) or merely inferred (from a naked TIMESTAMP converted
> type).  So
> > > perhaps another alternative might be to retain the isAdjustedToUTC
> boolean
> > > as is and add another field that indicates whether the adjusted flag
> was
> > > REPORTED or INFERRED (could be boolean or short variant or some type).
> > > This would allow interested readers to differentiate between old and
> new
> > > timestamps, while allowing other readers to enjoy the default you
> believe
> > > is warranted.  It seems most straightforward that it would be an
> additional
> > > parameter on the TimestampType, but I supposed it could reside in the
> > > FileMetaData struct (on the assumption that the schema elements, having
> > > been written by the same writer, all uniformly use converted type or
> > > LogicalType).
> > >
> > > --Tim
> > >
> > >
> > > On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi <[email protected]
> >
> > > wrote:
> > >
> > >> Hi Tim,
> > >>
> > >> In my opinion the specification of the older timestamp types only
> allowed
> > >> UTC-normalized storage, since these types were defined as the number
> of
> > >> milli/microseconds elapsed since the Unix epoch. This clearly defines
> the
> > >> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e.
> > >> 1970-01-01 00:00:00 UTC. It does not say anything about how this value
> > >> must
> > >> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC",
> but
> > >> typically it is displayed adjusted to the user's local timezone, for
> > >> example "1970-01-01 01:00:00" for a user in Paris. I don't think this
> > >> definition allows interpreting the numeric value 0 as "1970-01-01
> > >> 00:00:00"
> > >> in Paris, since the latter would correspond to 1969-12-31 23:00:00
> UTC,
> > >> which must be stored as the numeric value -3600 (times 10^3 for
> _MILLIS or
> > >> 10^6 for _MICROS) instead.
> > >>
> > >> I realize that compatibility with real-life usage patterns is
> important
> > >> regardless of whether they comply with the specification or not, but I
> > >> can't think of any solution that would be useful in practice. The
> > >> suggestion to turn the boolean into an enum would certainly allow
> Parquet
> > >> to have timestamps with unknown semantics, but I don't know what value
> > >> that
> > >> would bring to applications and how they would use it. I'm also afraid
> > >> that
> > >> the undefined semantics would get misused/overused by developers who
> are
> > >> not sure about the difference between the two semantics and we would
> end
> > >> up
> > >> with a lot of meaningless timestamps.
> > >>
> > >> Even with the problems I listed your suggestion may still be better
> than
> > >> the current solution, but before making a community decision I would
> like
> > >> to continue this discussion focusing on three questions:
> > >>
> > >>    - What are the implications of this change?
> > >>    - How will unknown semantics be used in practice?
> > >>    - Does it bring value?
> > >>    - Can we do better?
> > >>    - Can we even change the boolean to an enum? It has been specified
> like
> > >>    that and released a long time ago. Although I am not aware of any
> > >> software
> > >>    component that would have already implemented it, I was also
> unaware of
> > >>    software components using TIMESTAMP_MILLIS and _MICROS for local
> > >> semantics.
> > >>
> > >> One alternative that comes to my mind is to default to the more common
> > >> UTC-normalized semantics but allow overriding it in the reader schema.
> > >>
> > >> Thanks,
> > >>
> > >> Zoltan
> > >>
> > >> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau <[email protected]>
> wrote:
> > >>
> > >> > I'm not a long-time Parquet user, but I assisted in the expansion
> of the
> > >> > parquet-cpp library's LogicalType facility.
> > >> >
> > >> > My impression is that the original TIMESTAMP converted types were
> > >> silent on
> > >> > whether the annotated value was UTC adjusted and that (often arcane)
> > >> > out-of-band information had to be relied on by readers to decide
> the UTC
> > >> > adjustment status for timestamp columns.  It seemed to me that that
> > >> > perceived shortcoming was a primary motivator for adding the
> > >> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp
> > >> > LogicalType.  If that impression is accurate, then when reading
> > >> TIMESTAMP
> > >> > columns written by legacy (converted type only) writers, it seems
> > >> > inappropriate for LogicalType aware readers to unconditionally
> assign
> > >> > *either* "false" or "true" (as currently required) to a boolean UTC
> > >> > adjusted parameter, as that requires the reader to infer a property
> that
> > >> > wasn't implied by the writer.
> > >> >
> > >> > One possible approach to untangling this might be to amend the
> > >> > parquet.thrift specification to change the isAdjustedToUTC boolean
> > >> property
> > >> > to an enum or union type (some enumerated list) named (for example)
> > >> > UTCAdjustment with three possible values: Unknown, UTCAdjusted,
> > >> > NotUTCAdjusted (I'm not married to the names).  Extant files with
> > >> TIMESTAMP
> > >> > converted types only would map for forward compatibility to
> Timestamp
> > >> > LogicalTypes with UTCAdjustment:=Unknown .  New files with user
> supplied
> > >> > Timestamp LogicalTypes would always record the converted type as
> > >> TIMESTAMP
> > >> > for backward compatibility regardless of the value of the new
> > >> UTCAdjustment
> > >> > parameter (this would be lossy on a round-trip through a legacy
> library,
> > >> > but that's unavoidable -- and the legacy libraries would be no
> worse off
> > >> > than they are now).  The specification would normatively state that
> new
> > >> > user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either
> > >> > UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in
> new
> > >> > files).
> > >> >
> > >> > Thanks, Tim
> > >> >
> > >>
> > >
>

Reply via email to