[jira] [Updated] (DRILL-8099) Parquet record writer does not convert Dril local timestamp to UTC

Paul Rogers (Jira) Sat, 01 Jan 2022 14:59:08 -0800


     [ 
https://issues.apache.org/jira/browse/DRILL-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-8099:
-------------------------------
    Description: 
Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
the local time zone. This is, of course, highly awkward in today's age when UTC 
is used as the standard timestamp in most products. However, it is how Drill 
works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another 
topic.)

Each reader or writer that works with files that hold UTC timestamps must 
convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
Drill works correctly only when the server time zone is set to UTC.

Now, perhaps we can convince must shops to run their Drill server in UTC, or at 
least set the JVM timezone to UTC. However, this still leads developers in a 
lurch: if the development machine timezone is not UTC, then some tests fail. In 
particular:

{{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}

The reason that the above test fails is that the generated Parquet writer code 
assumes (incorrectly) that the Drill timestamp is in UTC and so no conversion 
is needed to write that data into Parquet. In particular, in 
{{{}ParquetOutputRecordWriter.getNewTimeStampConverter(){}}}:
{noformat}
    reader.read(holder);
    consumer.addLong(holder.value);
{noformat}
The JSON writer has the same problem:
{noformat}
  @Override
  public void writeTimestamp(FieldReader reader) throws IOException {
    if (reader.isSet()) {
      writeTimestamp(reader.readLocalDateTime());
    } else {
      writeTimeNull();
    }
  }
{noformat}
Basically, it takes a {{{}LocalDateTime{}}}, and formats it as a UTC timezone 
(using the "Z" suffix.) This is only valid if the machine is in the UTC time 
zone, which is why the test for this class attempts to force the local time 
zone to UTC, something that must users will not do.

A consequence of this bug is that "round trip" CTAS will change dates by the 
UTC offset of the machine running the CTAS. In the Pacific time zone, each 
"round trip" subtracts 8 hours from the time. After three round trips, the 
"UTC" date in the Parquet file or JSON will be a day earlier than the original 
data. One might argue that this "feature" is not always helpful.

.h3 Parquet Timestamp

Timestamp handling is quite confusing. Let's start by establishing the form of 
timestamp that Drill should write to Parquet. 

Drill uses the deprecated Parquet {{OriginalType}} to describe types, and uses 
{{TIME_MILLIS}} to describe a timestamp. In {{ParquetTypeHelper}}:

{noformat}
            originalTypeMap.put(MinorType.TIMESTAMP, 
OriginalType.TIMESTAMP_MILLIS);
{noformat}

The {{OriginalType}} is converted, within Parquet, to the preferred 
{{LogicalTypeAnnotation}} class:

{noformat}
      case TIMESTAMP_MILLIS:
        return timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
{noformat}

Where the declaration is:

{noformat}
  public static TimestampLogicalTypeAnnotation timestampType(final boolean 
isAdjustedToUTC, final TimeUnit unit) {
{noformat}

This means that the data written to Parquet should be "adjusted to UTC", that 
is, in the UTC timezone. Thus, Drill must convert from the "Drill timestamp" in 
the local time zone to UTC when writing to Parquet.

On the read side, the class {{ParquetToDrillTypeConverter}} maps Parquet types 
to Drill types:

{noformat}
          case TIMESTAMP_MILLIS:
          case TIMESTAMP_MICROS:
            return TypeProtos.MinorType.TIMESTAMP;
{noformat}

The {{NullableFixedByteAlignedReaders}} read some values. Here we see a bug:

{{noformat}}
  static class NullableDictionaryTimeStampReader extends 
NullableColumnReader<NullableTimeStampVector> {
    @Override
    protected void readField(long recordsToReadInThisPass) {
      ValuesReader valReader = usingDictionary ? 
pageReader.getDictionaryValueReader() : pageReader.getValueReader();
      for (int i = 0; i < recordsToReadInThisPass; i++){
        valueVec.getMutator().setSafe(valuesReadInCurrentPass + i, 
valReader.readLong());
      }
      advanceWriterIndex();
    }
  }
{{noformat}}

Here we see that we read the value from Parquet and write that value _without 
conversion_, into the Drill Timestamp vector. That is, when reading, we assume 
that the Parquet data is in the local time zone. We completely ignore the 
{{isAdjustedToUTC}} value which we asked to be set. Hence, we write UTC, but 
read UTC as local time.

  was:
Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
the local time zone. This is, of course, highly awkward in today's age when UTC 
is used as the standard timestamp in most products. However, it is how Drill 
works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another 
topic.)

Each reader or writer that works with files that hold UTC timestamps must 
convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
Drill works correctly only when the server time zone is set to UTC.

Now, perhaps we can convince must shops to run their Drill server in UTC, or at 
least set the JVM timezone to UTC. However, this still leads developers in a 
lurch: if the development machine timezone is not UTC, then some tests fail. In 
particular:

{{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}

The reason that the above test fails is that the generated Parquet writer code 
assumes (incorrectly) that the Drill timestamp is in UTC and so no conversion 
is needed to write that data into Parquet. In particular, in 
{{ParquetOutputRecordWriter.getNewTimeStampConverter()}}:

{noformat}
    reader.read(holder);
    consumer.addLong(holder.value);
{noformat}

The JSON writer has the same problem:

{noformat}
  @Override
  public void writeTimestamp(FieldReader reader) throws IOException {
    if (reader.isSet()) {
      writeTimestamp(reader.readLocalDateTime());
    } else {
      writeTimeNull();
    }
  }
{noformat}

Basically, it takes a {{LocalDateTime}}, and formats it as a UTC timezone 
(using the "Z" suffix.) This is only valid if the machine is in the UTC time 
zone, which is why the test for this class attempts to force the local time 
zone to UTC, something that must users will not do.


> Parquet record writer does not convert Dril local timestamp to UTC
> ------------------------------------------------------------------
>
>                 Key: DRILL-8099
>                 URL: https://issues.apache.org/jira/browse/DRILL-8099
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.19.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Drill follows the old SQL engine convention to store the `TIMESTAMP` type in 
> the local time zone. This is, of course, highly awkward in today's age when 
> UTC is used as the standard timestamp in most products. However, it is how 
> Drill works. (It would be great to add a `UTC_TIMESTAMP` type, but that is 
> another topic.)
> Each reader or writer that works with files that hold UTC timestamps must 
> convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, 
> Drill works correctly only when the server time zone is set to UTC.
> Now, perhaps we can convince must shops to run their Drill server in UTC, or 
> at least set the JVM timezone to UTC. However, this still leads developers in 
> a lurch: if the development machine timezone is not UTC, then some tests 
> fail. In particular:
> {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}
> The reason that the above test fails is that the generated Parquet writer 
> code assumes (incorrectly) that the Drill timestamp is in UTC and so no 
> conversion is needed to write that data into Parquet. In particular, in 
> {{{}ParquetOutputRecordWriter.getNewTimeStampConverter(){}}}:
> {noformat}
>     reader.read(holder);
>     consumer.addLong(holder.value);
> {noformat}
> The JSON writer has the same problem:
> {noformat}
>   @Override
>   public void writeTimestamp(FieldReader reader) throws IOException {
>     if (reader.isSet()) {
>       writeTimestamp(reader.readLocalDateTime());
>     } else {
>       writeTimeNull();
>     }
>   }
> {noformat}
> Basically, it takes a {{{}LocalDateTime{}}}, and formats it as a UTC timezone 
> (using the "Z" suffix.) This is only valid if the machine is in the UTC time 
> zone, which is why the test for this class attempts to force the local time 
> zone to UTC, something that must users will not do.
> A consequence of this bug is that "round trip" CTAS will change dates by the 
> UTC offset of the machine running the CTAS. In the Pacific time zone, each 
> "round trip" subtracts 8 hours from the time. After three round trips, the 
> "UTC" date in the Parquet file or JSON will be a day earlier than the 
> original data. One might argue that this "feature" is not always helpful.
> .h3 Parquet Timestamp
> Timestamp handling is quite confusing. Let's start by establishing the form 
> of timestamp that Drill should write to Parquet. 
> Drill uses the deprecated Parquet {{OriginalType}} to describe types, and 
> uses {{TIME_MILLIS}} to describe a timestamp. In {{ParquetTypeHelper}}:
> {noformat}
>             originalTypeMap.put(MinorType.TIMESTAMP, 
> OriginalType.TIMESTAMP_MILLIS);
> {noformat}
> The {{OriginalType}} is converted, within Parquet, to the preferred 
> {{LogicalTypeAnnotation}} class:
> {noformat}
>       case TIMESTAMP_MILLIS:
>         return timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
> {noformat}
> Where the declaration is:
> {noformat}
>   public static TimestampLogicalTypeAnnotation timestampType(final boolean 
> isAdjustedToUTC, final TimeUnit unit) {
> {noformat}
> This means that the data written to Parquet should be "adjusted to UTC", that 
> is, in the UTC timezone. Thus, Drill must convert from the "Drill timestamp" 
> in the local time zone to UTC when writing to Parquet.
> On the read side, the class {{ParquetToDrillTypeConverter}} maps Parquet 
> types to Drill types:
> {noformat}
>           case TIMESTAMP_MILLIS:
>           case TIMESTAMP_MICROS:
>             return TypeProtos.MinorType.TIMESTAMP;
> {noformat}
> The {{NullableFixedByteAlignedReaders}} read some values. Here we see a bug:
> {{noformat}}
>   static class NullableDictionaryTimeStampReader extends 
> NullableColumnReader<NullableTimeStampVector> {
>     @Override
>     protected void readField(long recordsToReadInThisPass) {
>       ValuesReader valReader = usingDictionary ? 
> pageReader.getDictionaryValueReader() : pageReader.getValueReader();
>       for (int i = 0; i < recordsToReadInThisPass; i++){
>         valueVec.getMutator().setSafe(valuesReadInCurrentPass + i, 
> valReader.readLong());
>       }
>       advanceWriterIndex();
>     }
>   }
> {{noformat}}
> Here we see that we read the value from Parquet and write that value _without 
> conversion_, into the Drill Timestamp vector. That is, when reading, we 
> assume that the Parquet data is in the local time zone. We completely ignore 
> the {{isAdjustedToUTC}} value which we asked to be set. Hence, we write UTC, 
> but read UTC as local time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (DRILL-8099) Parquet record writer does not convert Dril local timestamp to UTC

Reply via email to