Tim Armstrong has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative to UTC, > It's not that Hive and parquet-mr do it differently, it's simply that there I agree there's no logical timestamp type, but the physical type is still an INT96, not a generic binary type. I see that parquet-mr internally uses a byte array to represent INT96, but that's an implementation artifact of parquet-mr. My reasons for thinking this is a bug: * INT96 should be ordered in the same way as INT64 and INT32 * ordering INT96 by little-endian byte order is minimally useful for min-max pruning. It seems like this code that creates a BinaryStatistics object for an INT96 is the culprit: https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L61 -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
