Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 8: (23 comments) Looks great. I was able to look at 11 files and plan to finish the rest 9 files tomorrow. Most comments in this set are minors. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h File be/src/exec/hdfs-columnar-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h@59 PS8, Line 59: rows nit. use of 'tuples' instead of 'row' here should be better as tuples are mentioned earlier in the comment. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc File be/src/exec/hdfs-columnar-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@66 PS8, Line 66: must not be called when nit. Is it possible to DCHECK() it? http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@562 PS8, Line 562: has nit. have. May also include a sentence at the end to illustrate: For example, if the threshold is 10, then ... ... http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@935 PS8, Line 935: /// are the readers reading columns involved in either static filter or runtime filter. nit. Will be useful to describe non_filter_readers: All 'non_filter_readers' are responsible for reading surviving rows from those columns that are not involved in filtering. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117 PS8, Line 117: complete_micro_batch_.start = 0; : complete_micro_batch_.end = scratch_batch_->capacity - 1; : complete_micro_batch_.length = scratch_batch_->capacity; This probably should be moved to the cst. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@262 PS8, Line 262: for (int column_idx = 0; column_idx < column_readers.size(); column_idx++) { : ParquetColumnReader* column_reader = column_readers[column_idx]; : auto slot_desc = column_reader->slot_desc(); this loop can be written as for (auto column_reader : column_readers) { ... http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@268 PS8, Line 268: filter_readers Should we do the following prior to the for loop or DCHECK() they are empty? filter_readers->erase() non_filter_reders->erase() http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2215 PS8, Line 2215: Instead nit. Suggest to remove as it can cause confusion on the action taken in this step. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2218 PS8, Line 2218: in earlier iteratio remove? http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2236 PS8, Line 2236: disabled nit: "not needed" is better. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253 PS8, Line 2253: Status fill_status = FillScratchMicroBatches(filter_readers_, row_batch, : skip_row_group, &complete_micro_batch_, 1, scratch_batch_->capacity, : scratch_batch_->num_tuples); : if (!fill_status.ok() || *skip_row_group) { : return fill_status; : } nit Suggest to use the following pattern to enhance code readability. RETURN_IF_ERROR(foo( skip_row_group)); if (*skip_row_group) return Status::OK(); http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2282 PS8, Line 2282: num_tuples Suggest to use int* = nullptr instead of int& for the function, so that we can drop int num_tuples at line 2274. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291 PS8, Line 2291: if (!fill_status.ok() || *skip_row_group) { : return fill_status; nit. same comment on code readability. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2334 PS8, Line 2334: for (int c = 0; c < column_readers.size(); ++c) { : ParquetColumnReader* col_reader = column_readers[c]; nit. Use for (auto col_reader : column_readers) { http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336 PS8, Line 2336: due nit or due to http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336 PS8, Line 2336: in nit to http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2358 PS8, Line 2358: for (int c = 0; c < column_readers.size(); ++c) { : ParquetColumnReader* col_reader = column_readers[c] nit. Same comment above. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2363 PS8, Line 2363: if (r == 0) UNLIKELY http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2364 PS8, Line 2364: if (micro_batches[0].start > 0) { I wonder why we need this test. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@168 PS8, Line 168: /// row after 'skip_row_id'. nit. add "and ignores 'num_rows'". http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1151 PS8, Line 1151: Status::OK(); Should we return error instead OK here? http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1156 PS8, Line 1156: // Read next data page's header only. 'false' passed in call below signifies that. nit. Can put the comment next to the bool argument for better readability. RETURN_IF_ERROR(col_chunk_reader_.ReadNextDataPage(&eos, &data_, &data_size, false /* read next page header only */)); http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1189 PS8, Line 1189: // Data can be empty if the column contains all NULLs nit. I wonder if this comment is correct. For example, if the filter is NOT NULL, then this page should be looked at and survive. -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 8 Gerrit-Owner: Amogh Margoor <[email protected]> Gerrit-Reviewer: Amogh Margoor <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 21 Oct 2021 17:26:51 +0000 Gerrit-HasComments: Yes
