Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 2: > > (1 comment) > > Apologies for the delayed reply. Hive writes timestamps using 12 > bytes using little endian. Then it passes them to parquet-mr as a > BINARY string, which means it is hitting PARQUET-251. This explains > why I saw the odd values for min/max in my tests. > > Internally parquet-mr orders BINARY values using byte comparison, > potentially leading to a min/max value not being the semantically > smallest/largest value of a set of values. I am inclined to call > this a bug in hive, but I'm curious to hear what you think about > this. I don't think it's a bug that the min/max corresponds to the binary ordering, since at Parquet's level timestamps are just meaningless bytes. If we were using a proper Parquet logical type then it would be different, but when saving 12 bytes, I think the proper order is the binary ordering. In any case, I think we should aim for Hive-compatibility in this. The bug that causes the last row to be both the min and max values is a major pain though that will make column statistics for byte arrays totally useless. I don't see how we could handle that other than ignoring any such min/max values written by affected Hive versions. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: No
