Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 13:

(15 comments)

I spent a bit of time trying to break the code by injecting bugs and it looks 
like the test coverage is pretty good - there were a couple of things it didn't 
catch though that might be worth investigating a little.

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195
PS2, Line 1195:   // Populate RowGroup::sorting_columns with all columns 
specified by the Frontend.
> AFAIK we always write only one row group. I just copy-pasted this part from
I didn't follow up on this. It looks like the code only ever writes one row 
group but it's somewhat generalized to handle multiple row groups. I think 
people might have used this in the past to generate custom files. If we ever 
wrote out multiple row groups, we'd have to test it properly.

In the meantime I think we could remove this if and just use current_row_group_ 
below instead of hardcoding [0] and it would all work ok.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107
PS13, Line 107:       page_index_mem_tracker_(-1, "Parquet Page Index",
We can have many HdfsTableWriters per HdfsTableSink if we're doing non-cluster 
inserts.

As a rule of thumb, I think we should have O(1) MemTrackers per ExecNode or 
DataSink - O(n) or greater could cause problems. Otherwise the MemTracker dumps 
could get huge and cause problems if we want to serialize them into a string.

We could have one "Parquet Page Indices" MemTracker per HdfsTableSink or we 
could just track directly against the HdfsTableSink MemTracker.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212
PS13, Line 1212:   if (file_metadata_.row_groups.size() != 1) return 
Status::OK();
This


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214
PS13, Line 1214:   parquet::RowGroup* row_group = 
&(file_metadata_.row_groups[0]);
Can't we use current_row_group_ here?


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216
PS13, Line 1216:   for (int i = 0; i < columns_.size(); i++) {
I injected an error here by forgetting to write out one of the pages and the 
tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241
PS13, Line 1241:     row_group->columns[i].__set_offset_index_offset(file_pos_);
I added some errors to the offsets and lengths - looks like the tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31
PS13, Line 31:   ascending_boundary_order_ = true;
I tried injecting a bug by initializing these to "false". The tests didn't 
catch it. Would be good to add the coverage but this seems less bad of a 
coverage gap than the opposite (claiming an order that is not valid).


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56
PS13, Line 56: false
I tried injecting a bug by flipping the value of this and the tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60
PS13, Line 60:         if (MinMaxTrait<T>::Compare(prev_page_max_value_, 
cs->max_value_) < 0 ||
I tried injecting a bug by removing the first condition and the tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68
PS13, Line 68: max_value_
I tried switching this to cs->min_value. The tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69
PS13, Line 69:     prev_page_min_buffer_.Clear();
I tried removing this line. The tests didn't catch it, even under ASAN.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc@29
PS13, Line 29: int
nit: use static_cast<>


http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql
File testdata/bin/load-dependent-tables.sql:

http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77
PS13, Line 77: WITH DEFERRED REBUILD IN TABLE hive_index_tbl
missing trailing semicolon here and on some of the "DROP TABLE" statements 
below -I tried to run this file locally and it failed.


http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@94
PS13, Line 94: '\\\\'
I had to replace this with \\ to get this to work locally.


http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@106
PS13, Line 106: "}},
I had to replace }} with } and {{ with { to get this to run locally.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 25 Apr 2018 23:26:56 +0000
Gerrit-HasComments: Yes

Reply via email to