[
https://issues.apache.org/jira/browse/DRILL-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Turton reassigned DRILL-8099:
-----------------------------------
Assignee: James Turton
> Parquet does not convert from Drill local timestamp to UTC
> ----------------------------------------------------------
>
> Key: DRILL-8099
> URL: https://issues.apache.org/jira/browse/DRILL-8099
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.19.0
> Reporter: Paul Rogers
> Assignee: James Turton
> Priority: Major
>
> Drill follows the old SQL engine convention to store the `TIMESTAMP` type in
> the local time zone. This is, of course, highly awkward in today's age when
> UTC is used as the standard timestamp in most products. However, it is how
> Drill works. (It would be great to add a `UTC_TIMESTAMP` type, but that is
> another topic.)
> Each reader or writer that works with files that hold UTC timestamps must
> convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise,
> Drill works correctly only when the server time zone is set to UTC.
> Now, perhaps we can convince must shops to run their Drill server in UTC, or
> at least set the JVM timezone to UTC. However, this still leads developers in
> a lurch: if the development machine timezone is not UTC, then some tests
> fail. In particular:
> {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}
> The reason that the above test fails is that the generated Parquet writer
> code assumes (incorrectly) that the Drill timestamp is in UTC and so no
> conversion is needed to write that data into Parquet. In particular, in
> {{ParquetOutputRecordWriter.getNewTimeStampConverter()}}:
> {code:java}
> reader.read(holder);
> consumer.addLong(holder.value);
> {code}
> Basically, it takes a {{LocalDateTime}}, and formats it as a UTC timezone
> (using the "Z" suffix.) This is only valid if the machine is in the UTC time
> zone, which is why the test for this class attempts to force the local time
> zone to UTC, something that must users will not do.
> A consequence of this bug is that "round trip" CTAS will change dates by the
> UTC offset of the machine running the CTAS. In the Pacific time zone, each
> "round trip" subtracts 8 hours from the time. After three round trips, the
> "UTC" date in the Parquet file or JSON will be a day earlier than the
> original data. One might argue that this "feature" is not always helpful.
> This issue probably relates to a fix for DRILL-4203 and DRILL-6242.
> The test `TestNestedDateTimeTimestamp` even explains how Drill treats Parquet
> incorrectly, then proceeds to verify the incorrect behavior. The test
> {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASJson}} has been disabled
> because it combines a broken Parquet semantics with a corrected (see
> DRILL-8100) semantics, leading to a horrible mess. The test has been disabled
> pending a fix for this issue.
> See also {{TestDrillParquetReader.hiveTimestampArray()}} which "cheats" by
> mocking the machine timezone to UTC. That is, it states, boldly, that the
> Parquet Hive time support _will not work_ unless the machine running Drill is
> in the UTC timezone. This is a bug, not a feature. This test has been
> disabled as a result.
> h3. Parquet Timestamp Write
> Timestamp handling is quite confusing. Let's start by establishing the form
> of timestamp that Drill should write to Parquet.
> Drill uses the deprecated Parquet {{OriginalType}} to describe types, and
> uses {{TIME_MILLIS}} to describe a timestamp. In {{{}ParquetTypeHelper{}}}:
> {code:java}
> originalTypeMap.put(MinorType.TIMESTAMP,
> OriginalType.TIMESTAMP_MILLIS);
> {code}
> The {{OriginalType}} is converted, within Parquet, to the preferred
> {{LogicalTypeAnnotation}} class:
> {code:java}
> case TIMESTAMP_MILLIS:
> return timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
> {code}
> Where the declaration is:
> {code:java}
> public static TimestampLogicalTypeAnnotation timestampType(final boolean
> isAdjustedToUTC, final TimeUnit unit) {
> {code}
> This means that the data written to Parquet should be "adjusted to UTC", that
> is, in the UTC timezone. Thus, Drill must convert from the "Drill timestamp"
> in the local time zone to UTC when writing to Parquet.
> h3. Parquet Timestamp Read
> On the read side, the class {{ParquetToDrillTypeConverter}} maps Parquet
> types to Drill types:
> {code:java}
> case TIMESTAMP_MILLIS:
> case TIMESTAMP_MICROS:
> return TypeProtos.MinorType.TIMESTAMP;
> {code}
> The {{NullableFixedByteAlignedReaders}} read some values. Here we see a bug:
> {code:java}
> static class NullableDictionaryTimeStampReader extends
> NullableColumnReader<NullableTimeStampVector> {
> @Override
> protected void readField(long recordsToReadInThisPass) {
> ValuesReader valReader = usingDictionary ?
> pageReader.getDictionaryValueReader() : pageReader.getValueReader();
> for (int i = 0; i < recordsToReadInThisPass; i++){
> valueVec.getMutator().setSafe(valuesReadInCurrentPass + i,
> valReader.readLong());
> }
> advanceWriterIndex();
> }
> }
> {code}
> Here we see that we read the value from Parquet and write that value
> {_}without conversion{_}, into the Drill Timestamp vector. That is, when
> reading, we assume that the Parquet data is in the local time zone. We
> completely ignore the {{isAdjustedToUTC}} value which we asked to be set.
> Hence, we write UTC, but read UTC as local time.
> The {{isAdjustedToUTC}} attribute in Parquet may have been added after the
> original Drill code was written. As a result, Drill assumes local time,
> ignoring whether the data is in UTC or not.
> To recover the UTC attribute, we must consult the {{SchemaElement}} passed
> into {{{}NullableDictionaryTimeStampReader{}}}. This is a Thrift type with
> the required information in the {{LogicalType logicalType}} field. The
> {{TimestampType}} subclass provides the required information.
> All of this is further complicated by the Parquet INT96 timestamp type which
> seeks to emulate Impala/Hive. That timestamp can evidently either be "local"
> (no time zone) or UTC. Drill seems to handle this case exactly backwards in
> {{ParquetReaderUtility}}:
> {code:java}
> if (retainLocalTimezone) {
> return DateTimeZone.getDefault().convertUTCToLocal(dateTime);
> } else {
> return dateTime;
> }
> {code}
> If the data is in the "local" (unknown) timezone, then we should assume that
> it is in Drill's local timezone, and we should do _no conversion_. If,
> however, the time is UTC, then we must convert from UTC to local time before
> storing the data in Drill's Timestamp vector.
> h3. Metadata-based Pruning
> Parquet is made considerably more complex by the use of metadata: the planner
> prunes the scan before the scan operator even starts. Thus, errors in local
> vs. UTC occur in this phase as well. Modifying the writer to write UTC causes
> the filter pruning to fail, presumably because that pruning code assumes that
> the dates in Parquet are local. To see this, run
> {{TestCastFunctions.testCastTimestampLiteralInFilter}}. It the data written
> to Parquet is corrected to be UTC, then the Parquet reader is asked to read 0
> records, though it should read the single record in the file.
> h3. Recommendation
> Divide the problem into three parts:
> # Determine if this is a problem. Perhaps users don't store timestamps? Or,
> have worked out ways to work around the incorrect behavior? Do we have the
> resources to tackle such a complex issue?
> # Specify the correct behavior. Perhaps we need yet another option (groan)
> to say whether we write Drill Timestamps as local or UTC. (That is, whether
> we set the {{isAdjustedToUTC}} attribute.) We can always write local (with
> the attribute set to {{false}}), always write UT (with the attribute set to
> {{true}}), or let the user choose. What we can't do is claim to write UTC but
> actually write local time as we do today.
> # Determine user impact. Uses that have learned to work around the current
> incorrect behavior may encounter problems when we switch to the correct
> behavior.
> Then, to implement:
> # Clearly state the rules. Which parts of Drill use local time (the
> Timestamp vector) and which use UTC.
> # Specify the preferred conversions to/from UTC and Drill timestamps.
> (Currently every bit of code does it its own way.)
> # Identify in detail what Parquet, and its metadata pruning feature, does
> today.
> # Identify changes required.
> # Implement the changes.
> # Provide a robust set of unit tests that verify the intended behavior.
> Resist the temptation to fudge as current tests do, by forcing the timezone
> to UTC.
> The team may find that the easiest solution is to implement a UTC-based
> Timestamp type and simply eliminate all the questions of conversion. Such a
> move would help with another issue with local times: that, during daylight
> savings time transition, there is a non-uniform mapping from UTC to local
> times. Either one hour is omitted locally, or two UTC times maps to the same
> local time. Thus, during those transitions, operations with local times
> (difference between two times, say) will differ from those same operations on
> a UTC time.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)