Zoltan Ivanfi has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

> > (1 comment)
 > 
 > Apologies for the delayed reply. Hive writes timestamps using 12
 > bytes using little endian. Then it passes them to parquet-mr as a
 > BINARY string, which means it is hitting PARQUET-251. This explains
 > why I saw the odd values for min/max in my tests.
 > 
 > Internally parquet-mr orders BINARY values using byte comparison,
 > potentially leading to a min/max value not being the semantically
 > smallest/largest value of a set of values. I am inclined to call
 > this a bug in hive, but I'm curious to hear what you think about
 > this.

I don't think it's a bug that the min/max corresponds to the binary ordering, 
since at Parquet's level timestamps are just meaningless bytes. If we were 
using a proper Parquet logical type then it would be different, but when saving 
12 bytes, I think the proper order is the binary ordering. In any case, I think 
we should aim for Hive-compatibility in this.

The bug that causes the last row to be both the min and max values is a major 
pain though that will make column statistics for byte arrays totally useless. I 
don't see how we could handle that other than ignoring any such min/max values 
written by affected Hive versions.

-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: No

Reply via email to