Tim Armstrong has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 6: (13 comments) Thanks , I think this will be easier to move forward. I think to make progress on the other types we'll have to get buy-in in the upstream parquet project. I wonder if that may be tough for some data types - e.g. we'd ideally want to sort decimal by the logical value, but the min/max stats design doesn't really seem to be designed with logical types in mind. http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS6, Line 136: < <=? PS6, Line 178: ProcessValue Maybe AppendValue() or something like that? Process is a little vague. Line 389: bool ret = bool_values_->PutValue(v, 1); Consider something like: if (!bool_values_->PutValue(v, 1)) return false; PS6, Line 644: < <=? PS6, Line 648: column row group http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 105: static const int MAX_COLUMN_STATS_SIZE = 4 * 1024; Mention that this value came from parquet-mr? http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: PS6, Line 125: min Others might have different opinions but I think I'd prefer making this even more explicit and avoid relying on Impala's built-in operator overloads, just because the semantics are subtle and it's easy for someone reading this code to gloss over the fact that we can't always use the "obvious" ordering. The comments do explain it, but people don't always read the comments. Maybe we could have a set of functions like: ParquetStatsLessThan(int32_t, int32_t), etc. Might be worth getting someone else's opinion before making changes. Line 139: out->__set_min(min_str); It probably doesn't really matter, but I think using move() here would a copy of min_str and max_str PS6, Line 150: ((uint8_t*) reinterpret_cast Line 183: /// parquet-mr and subsequently Hive currently do not handle the following types We should file an Impala JIRA to keep track of this too http://gerrit.cloudera.org:8080/#/c/5611/6/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 208: class TestHdfsParquetTableStatsWriter(ImpalaTestSuite): It would be good to have some kind of test for multiple row groups so that code is exercised. Otherwise I can imagine it'd be easiest for a bug around resetting state between row groups to creep in. Could we do some kind of validation like writing out the data in sorted order, then validating that the min-max ranges don't overlap? Line 309: (0, 7299), It would be nice to test a table with a negative number to make sure that those are handled correctly. http://gerrit.cloudera.org:8080/#/c/5611/6/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: Line 111: return None This return isn't needed -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
