[Impala-ASF-CR] IMPALA-7853: Add support to read int64 NANO timestamps from Parquet

Csaba Ringhofer (Code Review) Tue, 27 Nov 2018 08:32:01 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11984 )


Change subject: IMPALA-7853: Add support to read int64 NANO timestamps from 
Parquet
......................................................................


Patch Set 2:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/11984/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11984/2//COMMIT_MSG@12
PS2, Line 12: 1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807 UTC
> Maybe you could mention that it stores the number of nanoseconds since the
Done


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h@479
PS2, Line 479: LimitedRange
> Dows this "LimitedRange" part of the name add extra value? Am I right that
There is already a function with nanosecond precision, and it can create the 
full range of timestamps:
FromUnixTimeNanos(time_t unix_time, int64_t nanos,
      const Timezone& local_tz);

That function could have been used here (after splitting nanoseconds to seconds 
and subseconds), but I think that it would slower than the specialized solution.


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h@507
PS2, Line 507:   /// conversion is necessary. Returns true if the schema 
describes a valid timestamp
> Could you please comment what the parameters are for, pls?
Done


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h@509
PS2, Line 509: ProcessSchema
> I feel that something like GetTimestampInfoFromSchema might be a bit more s
Done


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc
File be/src/exec/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@102
PS2, Line 102:   // Timestamps can be only encoded as INT64 or INT96.
> nit: You might want to adjust this comment to cover the case when this func
I am not sure if I have done what you meant. This function should be only 
called for TIMESTAMP SQL columns, and should return false if the Parquet column 
is not timestamp.


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@106
PS2, Line 106: Newer solution,
> nit: No need for this part. In addition, it's not guaranteed that reading t
Some explanation about "new": LogicalType is often called "new logical type", 
as converted type was also some kind of logical type. I agree that in 2 years 
this will not be meaningful.


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@122
PS2, Line 122: Older solution,
> No need for "Older solution" part of the comment.
Done


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@123
PS2, Line 123: are/were never
> nit: I think we shouldn't refer to what happened before this change as we s
This code has to handle files written by older versions of Impala/some other 
tools we try to be compatible with, so I think that referring to he past makes 
sense here.

Not specifying whether utc->local conversion is necessary was a gap in Parquet 
specification and was interpreted by different software differently. This will 
not be an issue with the new logical type, but we have to make assumptions with 
older files.


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@148
PS2, Line 148:   if (e.type == parquet::Type::INT96 && 
convert_int96_timestamps) needs_conversion = true;
> Shouldn't this line be also included in ProcessSchema? Since you introduced
I thought that doing it this way is the lesser evil - convert_int96_timestamps 
is not readily available in validation logic (it depends on a flag and the 
writer of the file).


http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-metadata-utils.cc@71
PS2, Line 71:   return ParquetTimestampDecoder::ProcessSchema(element, 
precision, needs_conversion);
> Just thinking out loud here: This ProcessSchema() is now used for 1) checki
Before this change the two functions were separate, but with the addition of 
logical types the code duplication became too much for me and I tried to merge 
the bulk of the two functions in ProcessSchema.

I think that the ideal way to do this would be to do the whole processing only 
once per column chunk (currently it is validated once, then processed again for 
stat filtering if needed, then once again to initialize the column reader). 
Changing this to use the same processed metadata would be a non-trivial 
refactor, as most of the code assumes everything needed for decoding can be 
simply deduced from the physical + SQL types, so handling of more complex types 
is a bit hacky.


http://gerrit.cloudera.org:8080/#/c/11984/2/fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java
File fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java:

http://gerrit.cloudera.org:8080/#/c/11984/2/fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java@248
PS2, Line 248:     LogicalTypeAnnotation logicalType = 
parquetType.getLogicalTypeAnnotation();
> What happens when there is no "logical type" in a Parquet file, but only "c
The Parquet API we use in Java does fill the logical type if only the converted 
type is filled - every converted type can be expressed as logical type, but not 
vice versa.



--
To view, visit http://gerrit.cloudera.org:8080/11984
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I932396d8646f43c0b9ca4a6359f164c4d8349d8f
Gerrit-Change-Number: 11984
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 27 Nov 2018 16:30:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7853: Add support to read int64 NANO timestamps from Parquet

Reply via email to