[jira] [Assigned] (DRILL-8099) Parquet does not convert from Drill local timestamp to UTC

James Turton (Jira) Fri, 21 Jan 2022 05:45:07 -0800


     [ 
https://issues.apache.org/jira/browse/DRILL-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


James Turton reassigned DRILL-8099:
-----------------------------------

    Assignee: James Turton

> Parquet does not convert from Drill local timestamp to UTC
> ----------------------------------------------------------
>
>                 Key: DRILL-8099
>                 URL: https://issues.apache.org/jira/browse/DRILL-8099
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.19.0
>            Reporter: Paul Rogers
>            Assignee: James Turton
>            Priority: Major
>
> Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
> the local time zone. This is, of course, highly awkward in today's age when 
> UTC is used as the standard timestamp in most products. However, it is how 
> Drill works. (It would be great to add a `UTC_TIMESTAMP` type, but that is 
> another topic.)
> Each reader or writer that works with files that hold UTC timestamps must 
> convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
> Drill works correctly only when the server time zone is set to UTC.
> Now, perhaps we can convince must shops to run their Drill server in UTC, or 
> at least set the JVM timezone to UTC. However, this still leads developers in 
> a lurch: if the development machine timezone is not UTC, then some tests 
> fail. In particular:
> {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}
> The reason that the above test fails is that the generated Parquet writer 
> code assumes (incorrectly) that the Drill timestamp is in UTC and so no 
> conversion is needed to write that data into Parquet. In particular, in 
> {{ParquetOutputRecordWriter.getNewTimeStampConverter()}}:
> {code:java}
>     reader.read(holder);
>     consumer.addLong(holder.value);
> {code}
> Basically, it takes a {{LocalDateTime}}, and formats it as a UTC timezone 
> (using the "Z" suffix.) This is only valid if the machine is in the UTC time 
> zone, which is why the test for this class attempts to force the local time 
> zone to UTC, something that must users will not do.
> A consequence of this bug is that "round trip" CTAS will change dates by the 
> UTC offset of the machine running the CTAS. In the Pacific time zone, each 
> "round trip" subtracts 8 hours from the time. After three round trips, the 
> "UTC" date in the Parquet file or JSON will be a day earlier than the 
> original data. One might argue that this "feature" is not always helpful.
> This issue probably relates to a fix for DRILL-4203 and DRILL-6242.
> The test `TestNestedDateTimeTimestamp` even explains how Drill treats Parquet 
> incorrectly, then proceeds to verify the incorrect behavior. The test 
> {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASJson}} has been disabled 
> because it combines a broken Parquet semantics with a corrected (see 
> DRILL-8100) semantics, leading to a horrible mess. The test has been disabled 
> pending a fix for  this issue.
> See also {{TestDrillParquetReader.hiveTimestampArray()}} which "cheats" by 
> mocking the machine timezone to UTC. That is, it states, boldly, that the 
> Parquet Hive time support _will not work_ unless the machine running Drill is 
> in the UTC timezone. This is a bug, not a feature. This test has been 
> disabled as a result.
> h3. Parquet Timestamp Write
> Timestamp handling is quite confusing. Let's start by establishing the form 
> of timestamp that Drill should write to Parquet.
> Drill uses the deprecated Parquet {{OriginalType}} to describe types, and 
> uses {{TIME_MILLIS}} to describe a timestamp. In {{{}ParquetTypeHelper{}}}:
> {code:java}
>             originalTypeMap.put(MinorType.TIMESTAMP, 
> OriginalType.TIMESTAMP_MILLIS);
> {code}
> The {{OriginalType}} is converted, within Parquet, to the preferred 
> {{LogicalTypeAnnotation}} class:
> {code:java}
>       case TIMESTAMP_MILLIS:
>         return timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
> {code}
> Where the declaration is:
> {code:java}
>   public static TimestampLogicalTypeAnnotation timestampType(final boolean 
> isAdjustedToUTC, final TimeUnit unit) {
> {code}
> This means that the data written to Parquet should be "adjusted to UTC", that 
> is, in the UTC timezone. Thus, Drill must convert from the "Drill timestamp" 
> in the local time zone to UTC when writing to Parquet.
> h3. Parquet Timestamp Read
> On the read side, the class {{ParquetToDrillTypeConverter}} maps Parquet 
> types to Drill types:
> {code:java}
>           case TIMESTAMP_MILLIS:
>           case TIMESTAMP_MICROS:
>             return TypeProtos.MinorType.TIMESTAMP;
> {code}
> The {{NullableFixedByteAlignedReaders}} read some values. Here we see a bug:
> {code:java}
>   static class NullableDictionaryTimeStampReader extends 
> NullableColumnReader<NullableTimeStampVector> {
>     @Override
>     protected void readField(long recordsToReadInThisPass) {
>       ValuesReader valReader = usingDictionary ? 
> pageReader.getDictionaryValueReader() : pageReader.getValueReader();
>       for (int i = 0; i < recordsToReadInThisPass; i++){
>         valueVec.getMutator().setSafe(valuesReadInCurrentPass + i, 
> valReader.readLong());
>       }
>       advanceWriterIndex();
>     }
>   }
> {code}
> Here we see that we read the value from Parquet and write that value 
> {_}without conversion{_}, into the Drill Timestamp vector. That is, when 
> reading, we assume that the Parquet data is in the local time zone. We 
> completely ignore the {{isAdjustedToUTC}} value which we asked to be set. 
> Hence, we write UTC, but read UTC as local time.
> The {{isAdjustedToUTC}} attribute in Parquet may have been added after the 
> original Drill code was written. As a result, Drill assumes local time, 
> ignoring whether the data is in UTC or not.
> To recover the UTC attribute, we must consult the {{SchemaElement}} passed 
> into {{{}NullableDictionaryTimeStampReader{}}}. This is a Thrift type with 
> the required information in the {{LogicalType logicalType}} field. The 
> {{TimestampType}} subclass provides the required information.
> All of this is further complicated by the Parquet INT96 timestamp type which 
> seeks to emulate Impala/Hive. That timestamp can evidently either be "local" 
> (no time zone) or UTC. Drill seems to handle this case exactly backwards in 
> {{ParquetReaderUtility}}:
> {code:java}
>       if (retainLocalTimezone) {
>         return DateTimeZone.getDefault().convertUTCToLocal(dateTime);
>       } else {
>         return dateTime;
>       }
> {code}
> If the data is in the "local" (unknown) timezone, then we should assume that 
> it is in Drill's local timezone, and we should do _no conversion_. If, 
> however, the time is UTC, then we must convert from UTC to local time before 
> storing the data in Drill's Timestamp vector.
> h3. Metadata-based Pruning
> Parquet is made considerably more complex by the use of metadata: the planner 
> prunes the scan before the scan operator even starts. Thus, errors in local 
> vs. UTC occur in this phase as well. Modifying the writer to write UTC causes 
> the filter pruning to fail, presumably because that pruning code assumes that 
> the dates in Parquet are local. To see this, run 
> {{TestCastFunctions.testCastTimestampLiteralInFilter}}. It the data written 
> to Parquet is corrected to be UTC, then the Parquet reader is asked to read 0 
> records, though it should read the single record in the file.
> h3. Recommendation
> Divide the problem into three parts:
>  # Determine if this is a problem. Perhaps users don't store timestamps? Or, 
> have worked out ways to work around the incorrect behavior? Do we have the 
> resources to tackle such a complex issue?
>  # Specify the correct behavior. Perhaps we need yet another option (groan) 
> to say whether we write Drill Timestamps as local or UTC. (That is, whether 
> we set the {{isAdjustedToUTC}} attribute.) We can always write local (with 
> the attribute set to {{false}}), always write UT (with the attribute set to 
> {{true}}), or let the user choose. What we can't do is claim to write UTC but 
> actually write local time as we do today.
>  # Determine user impact. Uses that have learned to work around the current 
> incorrect behavior may encounter problems when we switch to the correct 
> behavior.
> Then, to implement:
>  # Clearly state the rules. Which parts of Drill use local time (the 
> Timestamp vector) and which use UTC.
>  # Specify the preferred conversions to/from UTC and Drill timestamps. 
> (Currently every bit of code does it its own way.)
>  # Identify in detail what Parquet, and its metadata pruning feature, does 
> today.
>  # Identify changes required.
>  # Implement the changes.
>  # Provide a robust set of unit tests that verify the intended behavior. 
> Resist the temptation to fudge as current tests do, by forcing the timezone 
> to UTC.
> The team may find that the easiest solution is to implement a UTC-based 
> Timestamp type and simply eliminate all the questions of conversion. Such a 
> move would help with another issue with local times: that, during daylight 
> savings time transition, there is a non-uniform mapping from UTC to local 
> times. Either one hour is omitted locally, or two UTC times maps to the same 
> local time. Thus, during those transitions, operations with local times 
> (difference between two times, say) will differ from those same operations on 
> a UTC time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (DRILL-8099) Parquet does not convert from Drill local timestamp to UTC

Reply via email to