Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(1 comment)

It seems like we really need to talk through the big picture of what to do 
here, since there are so many unresolved problems in the parquet spec. 

It seems like min/max for int, float and bool is straightforward but all other 
cases are problematic.

http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS2, Line 35: DECIMAL
> I added to the comment to clarify on this.
No I don't think we need to support writing that encoding for decimals. Henry 
had a patch for reading that decimal encoding, so we may need to tread 
carefully when implementing the read path though.

That all said, I think the Parquet/parquet-mr's decimal min/max stats are 
broken. It looks like they compute min/max for decimal based on their default 
binary ordering. This has two problems:

* Ordering by unsigned bytes puts negative numbers ahead of positive numbers 
for twos-complement integers. We really need to . Maybe this is tolerable, 
although it probably makes it less effective (e.g. min: 0, max: -1 covers all 
possible values)
* Parquet-mr does something even stranger by treating the bytes as signed: 
https://issues.apache.org/jira/browse/PARQUET-686

I think essentially parquet-mr's stats are broken in all cases where 
BinaryStatistics is used here: 
https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L64


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to