Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 9: (24 comments) http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h File be/src/exec/hdfs-columnar-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h@59 PS8, Line 59: rows > nit. use of 'tuples' instead of 'row' here should be better as tuples are m Done. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc File be/src/exec/hdfs-columnar-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@66 PS8, Line 66: > nit. Is it possible to DCHECK() it? It is being checked in FilterScratchBatch. Moved comment to the same function too. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@110 PS8, Line 110: int HdfsColumnarScanner::ProcessScratchBatchCodegenOrInterpret(RowBatch* dst_batch) { > nit: unintended new line? Right. This function was changed in earlier patch with an extra argument. Hence formatting went for toss. Changed it now. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@562 PS8, Line 562: es > nit. have. Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@935 PS8, Line 935: /// Tuple memory to write to is specified by 'scratch_batch->tuple_mem'. > nit. Will be useful to describe non_filter_readers: All 'non_filter_readers Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117 PS8, Line 117: complete_micro_batch_.start = 0; : complete_micro_batch_.end = scratch_batch_->capacity - 1; : complete_micro_batch_.length = scratch_batch_->capacity; > This probably should be moved to the cst. what is cst ? http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@262 PS8, Line 262: if (DoesNotSupportLateMaterialization(column_reader) || (slot_desc != nullptr : && std::find(conjunct_slot_ids_.begin(), conjunct_slot_ids_.end(), slot_desc->id()) : != conjunct_slot_ids_.end())) { > this loop can be written as Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@268 PS8, Line 268: > Should we do the following prior to the for loop or DCHECK() they are empty Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2215 PS8, Line 2215: These c > nit. Suggest to remove as it can cause confusion on the action taken in thi Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2218 PS8, Line 2218: materializing tupl > remove? Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2236 PS8, Line 2236: ptr); > nit: "not needed" is better. it is both actually. have updated comment. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253 PS8, Line 2253: // 1. Filter rows only materializing the columns in 'filter_readers_' : // 2. Transfer the surviving rows : // 3. Materialize rest of the columns only for surviving rows. : : RETURN_IF_ERROR(FillScratchMicroBatches(filter_readers_, row_batch, : > nit Suggest to use the following pattern to enhance code readability. Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2282 PS8, Line 2282: omplete_mi > Suggest to use int* = nullptr instead of int& for the function, so that we It cannot be nullptr and will be modified by existing APIs like ReadValueBatch. but it has been changed to pointer though. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291 PS8, Line 2291: if (*skip_row_group) { return Status::OK(); } : } > nit. same comment on code readability. Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336 PS8, Line 2336: >S > nit to Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2358 PS8, Line 2358: bool continue_execution = false; : int last = -1; > nit. Same comment above. We need the index here unlike the above place. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2363 PS8, Line 2363: if (UNL > UNLIKELY In many cases it would be micro batch of size 1. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2364 PS8, Line 2364: return Status(Substitute("Cou > I wonder why we need this test. First batch might start from 100th row, in which case we need to skip first 100 rows. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@168 PS8, Line 168: /// row after 'skip_row_id' and ignores 'num_rows'. > nit. add "and ignores 'num_rows'". Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@509 PS8, Line 509: re not st > Please add a comment about 'remaining' Done. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1151 PS8, Line 1151: Status::OK(); > Should we return error instead OK here? Nope. As commented earlier if error needs to be returned LogCorruptNumValuesInMetadataError will do it. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1156 PS8, Line 1156: RETURN_IF_ERROR(col_chunk_reader_.ReadNextDataPage(&eos, &data_, &data_size, > nit. Can put the comment next to the bool argument for better readability. Done http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1282 PS8, Line 1282: * > Output parameters should be pointers. Done. changed it at other places too. http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h File be/src/exec/scratch-tuple-batch.h: http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@177 PS8, Line 177: int skip_length, ScratchMi > nit: usually we put output parameters at the end of the param list Done -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 9 Gerrit-Owner: Amogh Margoor <[email protected]> Gerrit-Reviewer: Amogh Margoor <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 21 Oct 2021 19:18:16 +0000 Gerrit-HasComments: Yes
