Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )
Change subject: IMPALA-11123: Optimize count(star) for ORC scans ...................................................................... Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@23 PS8, Line 23: can you add some benchmark results? I think that the difference should be visible in queries like count(*) from lineitem Parquet scanning may also become faster due to smaller estimates leading to skipping codegen http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc@780 PS3, Line 780: int64_t num_rows = GetNumberOfRowsInFile(); > Done Most of the logic still seems to be duplicated - this function seems to be nearly the same as the corresponding logic in Parquet. http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc@807 PS8, Line 807: // There are no materialized slots, e.g. "select 1" over the table. We can serve : // this query from just the file metadata. We don't need to read the column data. : // Only scanner of the footer split will run in this case. See the logic in : // HdfsScanner::IssueFooterRanges() and HdfsScanNodeBase::ReadsFileMetadataOnly(). : // We might also get here for count(*) query against full acid table such as: : // "select count(*) from functional_orc_def.alltypes;" : DCHECK(is_footer_scanner_); : int64_t file_rows = GetNumberOfRowsInFile(); : if (stripe_rows_read_ == file_rows) { : eos_ = true; : return Status::OK(); : } : assemble_rows_timer_.Start(); : DCHECK_LT(stripe_rows_read_, file_rows); : int64_t rows_remaining = file_rows - stripe_rows_read_; : int max_tuples = min<int64_t>(row_batch->capacity(), rows_remaining); : TupleRow* current_row = row_batch->GetRow(row_batch->AddRow()); : int num_to_commit = WriteTemplateTuples(current_row, max_tuples); : Status status = CommitRows(num_to_commit, row_batch); : assemble_rows_timer_.Stop(); : RETURN_IF_ERROR(status); : stripe_rows_read_ += max_tuples; : COUNTER_ADD(scan_node_->rows_read_counter(), num_to_commit); : return Status::OK(); : } It would be also nice to unify this with the Parquet version - my understanding is that the only difference is using stripe_rows_read_ vs row_group_rows_read_. A shared member could be used like row_group_or_stripe_rows_read_ in HdfsColumnarScanner http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test: http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@82 PS3, Line 82: cardinality=24 > Modified the cardinality count in HdfsScanNode.java. Can you mention this as an "addittional change" in the commit message? Currently it suggests that only ORC behavior was changed. http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test: http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@9 PS8, Line 9: bigint Don't we have a similar counter as for Parquet in the profile? ---- RUNTIME_PROFILE aggregation(SUM, NumRowGroups): 0 -- To view, visit http://gerrit.cloudera.org:8080/18327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Gerrit-Change-Number: 18327 Gerrit-PatchSet: 8 Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Tue, 29 Mar 2022 12:15:14 +0000 Gerrit-HasComments: Yes