Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 13: (15 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // Populate RowGroup::sorting_columns with all columns specified by the Frontend. > I didn't follow up on this. It looks like the code only ever writes one row Unfortunately it wouldn't work, because based on the description of the Parquet page indexes (https://github.com/apache/parquet-format/blob/master/PageIndex.md) we should write all the page indexes of all the row groups together, just before the footer. In other words, we have to flush all the row groups, then we can write all the page indexes. Currently Impala writes only one row group per Parquet file, so I just add a DCHECK here. If Impala's behavior changes in the future in this regard, we'll need to update the write path of the page indexes. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107 PS13, Line 107: page_index_mem_tracker_(-1, "Parquet Page Index", > We can have many HdfsTableWriters per HdfsTableSink if we're doing non-clus I track directly against HdfsTableSink. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212 PS13, Line 1212: if (file_metadata_.row_groups.size() != 1) return Status::OK(); > This Answered it at PS2. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214 PS13, Line 1214: parquet::RowGroup* row_group = &(file_metadata_.row_groups[0]); > Can't we use current_row_group_ here? We can't, detailed answer is at PS2. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216 PS13, Line 1216: for (int i = 0; i < columns_.size(); i++) { > I injected an error here by forgetting to write out one of the pages and th Thanks for checking. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241 PS13, Line 1241: row_group->columns[i].__set_offset_index_offset(file_pos_); > I added some errors to the offsets and lengths - looks like the tests caugh Thanks http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31 PS13, Line 31: ascending_boundary_order_ = true; > I tried injecting a bug by initializing these to "false". The tests didn't Currently we don't invoke Reset() on the ColumnStatsBase object where it would matter. In HdfsParquetTableWriter::BaseColumnWriter there are two ColumnStatsBase members: page_stats_base_ and row_group_stats_base_. For each new page, we invoke Reset() on page_stats_base_, but that doesn't corrupt the ordering, because we only consider the state of row_group_stats_base_ for that, and we never call Reset() on it. Even in BaseColumnWriter::Reset(), we throw away the old row_group_stats_base_ and create a brand new one. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56 PS13, Line 56: false > I tried injecting a bug by flipping the value of this and the tests caught Thanks! http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60 PS13, Line 60: if (MinMaxTrait<T>::Compare(prev_page_max_value_, cs->max_value_) < 0 || > I tried injecting a bug by removing the first condition and the tests caugh Thanks! http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68 PS13, Line 68: max_value_ > I tried switching this to cs->min_value. The tests caught it. Thanks! http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69 PS13, Line 69: prev_page_min_buffer_.Clear(); > I tried removing this line. The tests didn't catch it, even under ASAN. Turned out that the _validate_ordering function was too loose in test_parquet_page_index.py. We could remove both prev_page_min_buffer_ and prev_page_max_buffer_ without noticing it. Now _validate_ordering implements the same logic from parquet-column-stats to determine the expected ordering. Interestingly the tests only catch the case when we remove prev_page_max_buffer_.Clear(). The lack of prev_page_min_buffer.Clear() remained unnoticed. I tried to run the tests on a couple of existing tables, but without any luck. The reason is a bit subtle, basically prev_page_min_value_.ptr is "lucky" enough to point at memory addresses that make the tests pass (most of the time it points to the same address as cs->min_value_.ptr). I suppose that the tests would fail occasionally. I'll try a couple of other tables to see if they fail the tests when prev_page_min_buffer.Clear() is missing. For now I specialized the Merge function for StringValue, and added DCHECKs to test that we stored the prev pages' min/max values in their StringBuffers. http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc@29 PS13, Line 29: int > nit: use static_cast<> Done http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql File testdata/bin/load-dependent-tables.sql: http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77 PS13, Line 77: WITH DEFERRED REBUILD IN TABLE hive_index_tbl > missing trailing semicolon here and on some of the "DROP TABLE" statements My bad, I thought if I run a complete data load after formatting, it will execute this file as well, but now I see it's not. Now I run it through beeline (the same way as in create-data-load.sh), and my lines execute fine, but I need to comment out the above commands because the INSERT OVERWRITE TABLE command in L44 raises an error. Since I didn't touch those lines, I must have invoked it the wrong way. How do you execute this file? Anyway, I fixed the issues based on your comments. http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@94 PS13, Line 94: '\\\\' > I had to replace this with \\ to get this to work locally. Done http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@106 PS13, Line 106: "}}, > I had to replace }} with } and {{ with { to get this to run locally. Done -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 07 May 2018 16:22:47 +0000 Gerrit-HasComments: Yes
