Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out 
rows in Parquet table.
......................................................................


Patch Set 8:

(23 comments)

Looks great. I was able to look at 11 files and plan to finish the rest 9 files 
tomorrow.

Most comments in this set are minors.

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h@59
PS8, Line 59: rows
nit. use of 'tuples' instead of 'row' here should be better as tuples are 
mentioned earlier in the comment.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@66
PS8, Line 66: must not be called when
nit. Is it possible to DCHECK() it?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@562
PS8, Line 562: has
nit. have.

May also include a sentence at the end to illustrate:

For example, if the threshold is 10, then ... ...


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@935
PS8, Line 935:   /// are the readers reading columns involved in either static 
filter or runtime filter.
nit. Will be useful to describe non_filter_readers: All 'non_filter_readers' 
are responsible for reading surviving rows from those columns that are not 
involved in filtering.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117
PS8, Line 117:  complete_micro_batch_.start = 0;
             :   complete_micro_batch_.end = scratch_batch_->capacity - 1;
             :   complete_micro_batch_.length = scratch_batch_->capacity;
This probably should be moved to the cst.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@262
PS8, Line 262: for (int column_idx = 0; column_idx < column_readers.size(); 
column_idx++) {
             :     ParquetColumnReader* column_reader = 
column_readers[column_idx];
             :     auto slot_desc = column_reader->slot_desc();
this loop can be written as
for (auto column_reader : column_readers) {
...


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@268
PS8, Line 268: filter_readers
Should we do the following prior to the for loop or DCHECK() they are empty?

filter_readers->erase()
non_filter_reders->erase()


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2215
PS8, Line 2215: Instead
nit. Suggest to remove as it can cause confusion on the action taken in this 
step.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2218
PS8, Line 2218: in earlier iteratio
remove?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2236
PS8, Line 2236: disabled
nit: "not needed" is better.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253
PS8, Line 2253: Status fill_status = FillScratchMicroBatches(filter_readers_, 
row_batch,
              :         skip_row_group, &complete_micro_batch_, 1, 
scratch_batch_->capacity,
              :         scratch_batch_->num_tuples);
              :     if (!fill_status.ok() || *skip_row_group) {
              :       return fill_status;
              :     }
nit Suggest to use the following pattern to enhance code readability.

RETURN_IF_ERROR(foo( skip_row_group));
if (*skip_row_group) return Status::OK();


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2282
PS8, Line 2282: num_tuples
Suggest to use int* = nullptr instead of int& for the function, so that we can 
drop int num_tuples at line 2274.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291
PS8, Line 2291: if (!fill_status.ok() || *skip_row_group) {
              :         return fill_status;
nit. same comment on code readability.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2334
PS8, Line 2334:  for (int c = 0; c < column_readers.size(); ++c) {
              :       ParquetColumnReader* col_reader = column_readers[c];
nit. Use for (auto col_reader : column_readers) {


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336
PS8, Line 2336: due
nit or due to


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336
PS8, Line 2336: in
nit to


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2358
PS8, Line 2358:   for (int c = 0; c < column_readers.size(); ++c) {
              :     ParquetColumnReader* col_reader = column_readers[c]
nit. Same comment above.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2363
PS8, Line 2363: if (r == 0)
UNLIKELY


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2364
PS8, Line 2364:  if (micro_batches[0].start > 0) {
I wonder why we need this test.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@168
PS8, Line 168:   /// row after 'skip_row_id'.
nit. add "and ignores 'num_rows'".


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1151
PS8, Line 1151: Status::OK();
Should we return error instead OK here?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1156
PS8, Line 1156: // Read next data page's header only. 'false' passed in call 
below signifies that.
nit. Can put the comment next to the bool argument for better readability.

 RETURN_IF_ERROR(col_chunk_reader_.ReadNextDataPage(&eos, &data_, &data_size, 
false /* read next page header only */));


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1189
PS8, Line 1189: // Data can be empty if the column contains all NULLs
nit. I wonder if this comment is correct. For example, if the filter is NOT 
NULL, then this page should be looked at and survive.



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 21 Oct 2021 17:26:51 +0000
Gerrit-HasComments: Yes

Reply via email to