hi folks,

We have just recently implemented the new LogicalType unions in the
Parquet C++ library and we have run into a forward compatibility
problem with reader versions prior to this implementation.

To recap the issue, prior to the introduction of LogicalType, the
Parquet format had no explicit notion of time zones or UTC
normalization. The new TimestampType provides a flag to indicate
UTC-normalization

struct TimestampType {
1: required bool isAdjustedToUTC
2: required TimeUnit unit
}

When using this new type, the ConvertedType field must also be set for
forward compatibility (so that old readers can still understand the
data), but parquet.thrift says

// use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
true, unit = MICROS)
// use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
true, unit = MILLIS)
8: TimestampType TIMESTAMP

In Apache Arrow, we have 2 varieties of timestamps:

* Timestamp without time zone (no UTC normalization indicated)
* Timestamp with time zone (values UTC-normalized)

Prior to the introduction of LogicalType, we would set either
TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
normalization. So when reading the data back, any notion of having had
a time zone is lost (it could be stored in schema metadata if
desired).

I believe that setting the TIMESTAMP_* ConvertedType _only_ when
isAdjustedToUTC is true creates a forward compatibility break in this
regard. This was reported to us shortly after releasing Apache Arrow
0.14.0:

https://issues.apache.org/jira/browse/ARROW-5878

We are discussing setting the ConvertedType unconditionally in

https://github.com/apache/arrow/pull/4825

This might need to be a setting that is toggled when data is coming
from Arrow, but I wonder if the text in parquet.thrift is the intended
forward compatibility interpretation, and if not should we amend.

Thanks,
Wes

Reply via email to