Hi,

Oh, and one more thing: Before releasing the next Arrow version
incorporating the new logical types, we should definitely test that their
behaviour matches that of parquet-mr. When is the next release planned to
come out?

Br,

Zoltan

On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <z...@cloudera.com> wrote:

> Hi Wes,
>
> Yes, I agree that we should do that, but then we have a problem of what to
> do in the other direction, i.e. when we use the new logical types API to
> read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> normalized flag? Tim has started a discussion about that, suggesting to use
> three states that I just answered.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> Thank for the comments.
>>
>> So in summary I think that we need to set the TIMESTAMP_* converted
>> types to maintain forward compatibility and stay consistent with what
>> we were doing in the C++ library prior to the introduction of the
>> LogicalType metadata.
>>
>> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <z...@cloudera.com.invalid>
>> wrote:
>> >
>> > Hi Wes,
>> >
>> > Both of the semantics are deterministic in one aspect and
>> indeterministic
>> > in another. Timestamps of instant semantic will always refer to the same
>> > instant, but their user-facing representation (how they get displayed)
>> > depends on the user's time zone. Timestamps of local semantics always
>> have
>> > the same user-facing representation but the instant they refer to is
>> > undefined (or ambigous, depending on your point of view).
>> >
>> > My understanding is that Spark uses instant semantics, i.e., timestamps
>> are
>> > stored normalized to UTC and are displayed adjusted to the user's local
>> > time zone.
>> >
>> > Br,
>> >
>> > Zoltan
>> >
>> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >
>> > > Thanks Zoltan.
>> > >
>> > > This is definitely a tricky issue.
>> > >
>> > > Spark's application of localtime semantics to timestamp data has been
>> > > a source of issues for many people. Personally I don't find that
>> > > behavior to be particularly helpful since depending on the session
>> > > time zone, you will get different results on data not marked as
>> > > UTC-normalized.
>> > >
>> > > In pandas, as contrast, we apply UTC semantics to
>> > > naive/not-explicitly-normalized data so at least the code produces
>> > > deterministic results on all environments. That seems strictly better
>> > > to me -- if you want a localized interpretation of naive data, you
>> > > must opt into this rather than having it automatically selected based
>> > > on your locale. The instances of people shooting their toes off due to
>> > > time zones are practically non-existent, whereas I'm hearing about
>> > > Spark gotchas all the time.
>> > >
>> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <z...@cloudera.com.invalid
>> >
>> > > wrote:
>> > > >
>> > > > Hi Wes,
>> > > >
>> > > > The rules for TIMESTAMP forward-compatibility were created based on
>> the
>> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
>> been used
>> > > > in the instant aka. UTC-normalized semantics so far. This
>> assumption was
>> > > > supported by two sources:
>> > > >
>> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
>> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
>> the
>> > > Unix
>> > > > epoch, an instant specified in UTC, from which it follows that they
>> have
>> > > > instant semantics (because timestamps of local semantics do not
>> > > correspond
>> > > > to a single instant).
>> > > >
>> > > > 2. Anecdotal knowledge: We were not aware of any software component
>> that
>> > > > used these types differently from the specification.
>> > > >
>> > > > Based on your e-mail, we were wrong on #2.
>> > > >
>> > > > From this false premise it followed that TIMESTAMPs with local
>> semantics
>> > > > were a new type and did not need to be annotated with the old types
>> to
>> > > > maintain compatibility. In fact, annotating them with the old types
>> were
>> > > > considered to be harmful, since it would have mislead older readers
>> into
>> > > > thinking that they can read TIMESTAMPs with local semantics, when in
>> > > > reality they would have misinterpreted them as TIMESTAMPs with
>> instant
>> > > > semantics. This would have lead to a difference of several hours,
>> > > > corresponding to the time zone offset.
>> > > >
>> > > > In the light of your e-mail, this misinterpretation of timestamps
>> may
>> > > > already be happening, since if Arrow annotates local timestamps with
>> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets
>> them
>> > > as
>> > > > timestamps with instant semantics, leading to a difference of
>> several
>> > > hours.
>> > > >
>> > > > Based on this, I think it would make sense from Arrow's point of
>> view to
>> > > > annotate both semantics with the old types, since that is its
>> historical
>> > > > behaviour and keeping it up is needed for maintaining compatibilty.
>> I'm
>> > > not
>> > > > so sure about the Java library though, since as far as I know, these
>> > > types
>> > > > were never used in the local sense there (although I may be wrong
>> again).
>> > > > Were we to decide that Arrow and parquet-mr should behave
>> differently in
>> > > > this aspect though, it may be tricky to convey this distinction in
>> the
>> > > > specification. I would be interested in hearing your and other
>> > > developers'
>> > > > opinions on this.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Zoltan
>> > > >
>> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > > >
>> > > > > hi folks,
>> > > > >
>> > > > > We have just recently implemented the new LogicalType unions in
>> the
>> > > > > Parquet C++ library and we have run into a forward compatibility
>> > > > > problem with reader versions prior to this implementation.
>> > > > >
>> > > > > To recap the issue, prior to the introduction of LogicalType, the
>> > > > > Parquet format had no explicit notion of time zones or UTC
>> > > > > normalization. The new TimestampType provides a flag to indicate
>> > > > > UTC-normalization
>> > > > >
>> > > > > struct TimestampType {
>> > > > > 1: required bool isAdjustedToUTC
>> > > > > 2: required TimeUnit unit
>> > > > > }
>> > > > >
>> > > > > When using this new type, the ConvertedType field must also be
>> set for
>> > > > > forward compatibility (so that old readers can still understand
>> the
>> > > > > data), but parquet.thrift says
>> > > > >
>> > > > > // use ConvertedType TIMESTAMP_MICROS for
>> TIMESTAMP(isAdjustedToUTC =
>> > > > > true, unit = MICROS)
>> > > > > // use ConvertedType TIMESTAMP_MILLIS for
>> TIMESTAMP(isAdjustedToUTC =
>> > > > > true, unit = MILLIS)
>> > > > > 8: TimestampType TIMESTAMP
>> > > > >
>> > > > > In Apache Arrow, we have 2 varieties of timestamps:
>> > > > >
>> > > > > * Timestamp without time zone (no UTC normalization indicated)
>> > > > > * Timestamp with time zone (values UTC-normalized)
>> > > > >
>> > > > > Prior to the introduction of LogicalType, we would set either
>> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
>> > > > > normalization. So when reading the data back, any notion of
>> having had
>> > > > > a time zone is lost (it could be stored in schema metadata if
>> > > > > desired).
>> > > > >
>> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
>> > > > > isAdjustedToUTC is true creates a forward compatibility break in
>> this
>> > > > > regard. This was reported to us shortly after releasing Apache
>> Arrow
>> > > > > 0.14.0:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/ARROW-5878
>> > > > >
>> > > > > We are discussing setting the ConvertedType unconditionally in
>> > > > >
>> > > > > https://github.com/apache/arrow/pull/4825
>> > > > >
>> > > > > This might need to be a setting that is toggled when data is
>> coming
>> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
>> intended
>> > > > > forward compatibility interpretation, and if not should we amend.
>> > > > >
>> > > > > Thanks,
>> > > > > Wes
>> > > > >
>> > >
>>
>

Reply via email to