Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative to UTC, > That's concerning if Hive and parquet-mr are inconsistent in how they defin It's not that Hive and parquet-mr do it differently, it's simply that there is no such type as our timestamp in Parquet (there is a separate timestamp_millis type), so parquet-mr doesn't handle this type at all, while Hive does. From Parquet's perspective it's just 96 bits without meaning, so probably the min and max should be calculated accordingly as well and not based on the base type. On the other hand, if Hive already has some implementation, we should probably use the same logic. But if I remember correctly, Lars looked at Hive's min/max values for timestamps and found that they were not the min/max values in any way, neither as instants on a timeline, neither as the 96 bits in any byte order. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Ivanfi <zi+ger...@cloudera.com> Gerrit-HasComments: Yes