Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 3: (9 comments) Was able to complete the review. Have some comments. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h File be/src/exec/hdfs-columnar-scanner.h: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h@60 PS3, Line 60: bool* selected_rows Need to document the use of this new argument in the comment above. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@249 PS3, Line 249: vector<ParquetColumnReader*>& nit. const vector<ParquetColumnReader*>&. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250 PS3, Line 250: vector<ParquetColumnReader*>& filter_readers, : vector<ParquetColumnReader*>& non_filter_readers) In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument is to be modified. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253 PS3, Line 253: conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->conjuncts()), : std::end(scan_node_->conjuncts())); : conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->filter_exprs()), Can you please check if the code is sufficient to deal with min/max filters from HJs? For min/max filter F from HJ, we allocate two slots in minMaxTuple_ (see HdfsScanNode.java line 843 to 846, and save the start slot index, together with the filter Id, in overlapPredicateDescs_. Thus, we may need to iterate over BE version of overlapPredicateDescs_ (see GetOverlapPredicateDescs() in parquet/hdfs-parquet-scanner.h) to grab the pair of the slots from which the column path can be found. Is it possible to later materialize rows surviving a min/max filter? http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@270 PS3, Line 270: vector<ScalarExpr*> conjuncts should use const T&. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330 PS3, Line 2330: int skip_length) By reading the code, my guess is that each batch covers a number of, up to skip length, selected rows. When the selected rows interleave with non-selected rows (for example, T, T, F, T, F, F, T), then use of selected_rows as the representation should be better, since here in the code, the batch covers both rows T and F. When the selected rows form a cluster (such as on a sorted column), then the current implementation selects on span of rows which is very nice. Is it possible to work with selected_rows directly? http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2377 PS3, Line 2377: continue_execution should init this boolean, as the code below conditionally set it. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381 PS3, Line 2381: if (micro_batches[0].start > 0) { : if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) { : return Status(Substitute("Couldn't skip rows in file $0.", filename())); : } : } Do we need an else branch to deal with micro_batches[0].start==0? If it is not possible, a DCHECK() on it probably should be added. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417 PS3, Line 2417: return Status::OK(); I wonder if a non Status::OK() object should return here. -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 3 Gerrit-Owner: Amogh Margoor <[email protected]> Gerrit-Reviewer: Amogh Margoor <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 04 Oct 2021 19:38:15 +0000 Gerrit-HasComments: Yes
