Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 1:

> > That is really unfortunate that our timestamps are treated as
 > byte
 > > arrays by parquet-mr - it makes the min/max stats mostly useless
 > > for pruning files. I feel like this is a bug in parquet-mr, since
 > > INT96 is in the spec 
 > > (https://github.com/apache/parquet-format/blob/98c5e2b8575a809b09d996080428be730614d374/Encodings.md)
 > > and it's being treated inconsistently with int32/int64. Common
 > > sense would dictate that min/max of int96 should be treated the
 > > same as int32/int64. Seems like something we should open an issue
 > > against Parquet for? And Hive? Otherwise our timestamp stats will
 > > be pretty useless. In any case we should clarify this before
 > > writing out our own incompatible stats.
 > 
 > I agree, in fact this may actually be two separate bugs.
 > 
 > 1) parquet-mr uses Binary internally to store INT96, and will use
 > BinaryStatistics for those values 
 > (https://github.com/Parquet/parquet-mr/blob/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L61).
 > 2) Hive hands Timestamps over to parquet-mr as BINARY, too, instead
 > of using INT96. Currently these won't make a difference, but once
 > statistics support for INT96 will be fixed in parquet-mr, Hive
 > would need to catch up.
 > 
 > @Zoltan, should I go ahead and open one issue with each of them to
 > sort this out?

Our read path will have to contain some logic to deal with corrupt statistics 
written by parquet-mr 1.5, so we can filter those out. In the same code path we 
could filter all timestamp statistics written by Hive until the ordering get's 
fixed.

However, the statistics we write would be incompatible with Hive. Older 
versions of Hive will be unable to detect that the semantics have changed from 
little endian binary ordering to numeric ordering, so I currently don't see an 
alternative to encoding them in 12 byte little endian binaries, and then 
ordering them bytewise.

-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+ger...@cloudera.com>
Gerrit-HasComments: No

Reply via email to