Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 13: (15 comments) I spent a bit of time trying to break the code by injecting bugs and it looks like the test coverage is pretty good - there were a couple of things it didn't catch though that might be worth investigating a little. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // Populate RowGroup::sorting_columns with all columns specified by the Frontend. > AFAIK we always write only one row group. I just copy-pasted this part from I didn't follow up on this. It looks like the code only ever writes one row group but it's somewhat generalized to handle multiple row groups. I think people might have used this in the past to generate custom files. If we ever wrote out multiple row groups, we'd have to test it properly. In the meantime I think we could remove this if and just use current_row_group_ below instead of hardcoding [0] and it would all work ok. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107 PS13, Line 107: page_index_mem_tracker_(-1, "Parquet Page Index", We can have many HdfsTableWriters per HdfsTableSink if we're doing non-cluster inserts. As a rule of thumb, I think we should have O(1) MemTrackers per ExecNode or DataSink - O(n) or greater could cause problems. Otherwise the MemTracker dumps could get huge and cause problems if we want to serialize them into a string. We could have one "Parquet Page Indices" MemTracker per HdfsTableSink or we could just track directly against the HdfsTableSink MemTracker. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212 PS13, Line 1212: if (file_metadata_.row_groups.size() != 1) return Status::OK(); This http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214 PS13, Line 1214: parquet::RowGroup* row_group = &(file_metadata_.row_groups[0]); Can't we use current_row_group_ here? http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216 PS13, Line 1216: for (int i = 0; i < columns_.size(); i++) { I injected an error here by forgetting to write out one of the pages and the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241 PS13, Line 1241: row_group->columns[i].__set_offset_index_offset(file_pos_); I added some errors to the offsets and lengths - looks like the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31 PS13, Line 31: ascending_boundary_order_ = true; I tried injecting a bug by initializing these to "false". The tests didn't catch it. Would be good to add the coverage but this seems less bad of a coverage gap than the opposite (claiming an order that is not valid). http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56 PS13, Line 56: false I tried injecting a bug by flipping the value of this and the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60 PS13, Line 60: if (MinMaxTrait<T>::Compare(prev_page_max_value_, cs->max_value_) < 0 || I tried injecting a bug by removing the first condition and the tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68 PS13, Line 68: max_value_ I tried switching this to cs->min_value. The tests caught it. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69 PS13, Line 69: prev_page_min_buffer_.Clear(); I tried removing this line. The tests didn't catch it, even under ASAN. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc@29 PS13, Line 29: int nit: use static_cast<> http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql File testdata/bin/load-dependent-tables.sql: http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77 PS13, Line 77: WITH DEFERRED REBUILD IN TABLE hive_index_tbl missing trailing semicolon here and on some of the "DROP TABLE" statements below -I tried to run this file locally and it failed. http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@94 PS13, Line 94: '\\\\' I had to replace this with \\ to get this to work locally. http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@106 PS13, Line 106: "}}, I had to replace }} with } and {{ with { to get this to run locally. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 25 Apr 2018 23:26:56 +0000 Gerrit-HasComments: Yes
