Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11984 )
Change subject: IMPALA-7853: Add support to read int64 NANO timestamps from Parquet ...................................................................... Patch Set 2: (11 comments) http://gerrit.cloudera.org:8080/#/c/11984/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11984/2//COMMIT_MSG@12 PS2, Line 12: 1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807 UTC > Maybe you could mention that it stores the number of nanoseconds since the Done http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h@479 PS2, Line 479: LimitedRange > Dows this "LimitedRange" part of the name add extra value? Am I right that There is already a function with nanosecond precision, and it can create the full range of timestamps: FromUnixTimeNanos(time_t unix_time, int64_t nanos, const Timezone& local_tz); That function could have been used here (after splitting nanoseconds to seconds and subseconds), but I think that it would slower than the specialized solution. http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h@507 PS2, Line 507: /// conversion is necessary. Returns true if the schema describes a valid timestamp > Could you please comment what the parameters are for, pls? Done http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.h@509 PS2, Line 509: ProcessSchema > I feel that something like GetTimestampInfoFromSchema might be a bit more s Done http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc File be/src/exec/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@102 PS2, Line 102: // Timestamps can be only encoded as INT64 or INT96. > nit: You might want to adjust this comment to cover the case when this func I am not sure if I have done what you meant. This function should be only called for TIMESTAMP SQL columns, and should return false if the Parquet column is not timestamp. http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@106 PS2, Line 106: Newer solution, > nit: No need for this part. In addition, it's not guaranteed that reading t Some explanation about "new": LogicalType is often called "new logical type", as converted type was also some kind of logical type. I agree that in 2 years this will not be meaningful. http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@122 PS2, Line 122: Older solution, > No need for "Older solution" part of the comment. Done http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@123 PS2, Line 123: are/were never > nit: I think we shouldn't refer to what happened before this change as we s This code has to handle files written by older versions of Impala/some other tools we try to be compatible with, so I think that referring to he past makes sense here. Not specifying whether utc->local conversion is necessary was a gap in Parquet specification and was interpreted by different software differently. This will not be an issue with the new logical type, but we have to make assumptions with older files. http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-common.cc@148 PS2, Line 148: if (e.type == parquet::Type::INT96 && convert_int96_timestamps) needs_conversion = true; > Shouldn't this line be also included in ProcessSchema? Since you introduced I thought that doing it this way is the lesser evil - convert_int96_timestamps is not readily available in validation logic (it depends on a flag and the writer of the file). http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/11984/2/be/src/exec/parquet-metadata-utils.cc@71 PS2, Line 71: return ParquetTimestampDecoder::ProcessSchema(element, precision, needs_conversion); > Just thinking out loud here: This ProcessSchema() is now used for 1) checki Before this change the two functions were separate, but with the addition of logical types the code duplication became too much for me and I tried to merge the bulk of the two functions in ProcessSchema. I think that the ideal way to do this would be to do the whole processing only once per column chunk (currently it is validated once, then processed again for stat filtering if needed, then once again to initialize the column reader). Changing this to use the same processed metadata would be a non-trivial refactor, as most of the code assumes everything needed for decoding can be simply deduced from the physical + SQL types, so handling of more complex types is a bit hacky. http://gerrit.cloudera.org:8080/#/c/11984/2/fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java File fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java: http://gerrit.cloudera.org:8080/#/c/11984/2/fe/src/main/java/org/apache/impala/analysis/ParquetHelper.java@248 PS2, Line 248: LogicalTypeAnnotation logicalType = parquetType.getLogicalTypeAnnotation(); > What happens when there is no "logical type" in a Parquet file, but only "c The Parquet API we use in Java does fill the logical type if only the converted type is filled - every converted type can be expressed as logical type, but not vice versa. -- To view, visit http://gerrit.cloudera.org:8080/11984 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I932396d8646f43c0b9ca4a6359f164c4d8349d8f Gerrit-Change-Number: 11984 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 27 Nov 2018 16:30:39 +0000 Gerrit-HasComments: Yes
