Lars Volker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 1: > That is really unfortunate that our timestamps are treated as byte > arrays by parquet-mr - it makes the min/max stats mostly useless > for pruning files. I feel like this is a bug in parquet-mr, since > INT96 is in the spec > (https://github.com/apache/parquet-format/blob/98c5e2b8575a809b09d996080428be730614d374/Encodings.md) > and it's being treated inconsistently with int32/int64. Common > sense would dictate that min/max of int96 should be treated the > same as int32/int64. Seems like something we should open an issue > against Parquet for? And Hive? Otherwise our timestamp stats will > be pretty useless. In any case we should clarify this before > writing out our own incompatible stats. I agree, in fact this may actually be two separate bugs. 1) parquet-mr uses Binary internally to store INT96, and will use BinaryStatistics for those values (https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L61). 2) Hive hands Timestamps over to parquet-mr as BINARY, too, instead of using INT96. Currently these won't make a difference, but once statistics support for INT96 will be fixed in parquet-mr, Hive would need to catch up. @Zoltan, should I go ahead and open one issue with each of them to sort this out? -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Ivanfi <zi+ger...@cloudera.com> Gerrit-HasComments: No