Lars Volker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 9: (8 comments) Thanks for the review. Please see my inline comments and PS11. http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 339: int64_t encoded_value_size_; > sounds good Done http://gerrit.cloudera.org:8080/#/c/5611/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 139: } > why not just pass in metadata->statistics and then set the __isset flag by Done Line 652: // Add the size of the data page header > avoid copy by passing in header.data_page_header.statistics Done http://gerrit.cloudera.org:8080/#/c/5611/10/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 103: /// Maximum statistics size. If the combined size of the min and max values of > qualify as 'parquet.Statistics' so it's clearer Done. I used :: since that is the class name in its namespace. Do you prefer "."? http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 127: // statistics behavior from any implicit behavior of the types? > i understand how that may not be the case today, but in order for them to b If Parquet's and Impala's ordering were "roughly the same", then we would need some translation between our min values and the ones in Parquet. For our current types, I don't see that as a problem either, but I think Tim was concerned about adding types in the future and preventing potential bugs. I'll let Tim add his thoughts to the discussion, personally I'm good with using min/max for now. The comment was there to facilitate this discussion, since it came up in reviews of previous patch sets. I will remove it. http://gerrit.cloudera.org:8080/#/c/5611/10/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 84: > remove Without these, clang-format will undo all manual changes to the style on lines modified by this change. I added it as a TODO to the commit message to remove those once the change has a +2, when I will have to rebase it anyways. Line 87: class ColumnStats : public ColumnStatsBase { > indent the subsequent lines belonging to the logical expr two more spaces ( Done Line 157: /// Returns the number of bytes needed to encode value 'v'. > this is very verbose. why needed? See my previous comment. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
