Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@23
PS8, Line 23:
can you add some benchmark results? I think that the difference should be 
visible in queries like count(*) from lineitem

Parquet scanning may also become faster due to smaller estimates leading to 
skipping codegen


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc@780
PS3, Line 780:   int64_t num_rows = GetNumberOfRowsInFile();
> Done
Most of the logic still seems to be duplicated - this function seems to be 
nearly the same as the corresponding logic in Parquet.


http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc@807
PS8, Line 807:     // There are no materialized slots, e.g. "select 1" over the 
table.  We can serve
             :     // this query from just the file metadata.  We don't need to 
read the column data.
             :     // Only scanner of the footer split will run in this case. 
See the logic in
             :     // HdfsScanner::IssueFooterRanges() and 
HdfsScanNodeBase::ReadsFileMetadataOnly().
             :     // We might also get here for count(*) query against full 
acid table such as:
             :     // "select count(*) from functional_orc_def.alltypes;"
             :     DCHECK(is_footer_scanner_);
             :     int64_t file_rows = GetNumberOfRowsInFile();
             :     if (stripe_rows_read_ == file_rows) {
             :       eos_ = true;
             :       return Status::OK();
             :     }
             :     assemble_rows_timer_.Start();
             :     DCHECK_LT(stripe_rows_read_, file_rows);
             :     int64_t rows_remaining = file_rows - stripe_rows_read_;
             :     int max_tuples = min<int64_t>(row_batch->capacity(), 
rows_remaining);
             :     TupleRow* current_row = 
row_batch->GetRow(row_batch->AddRow());
             :     int num_to_commit = WriteTemplateTuples(current_row, 
max_tuples);
             :     Status status = CommitRows(num_to_commit, row_batch);
             :     assemble_rows_timer_.Stop();
             :     RETURN_IF_ERROR(status);
             :     stripe_rows_read_ += max_tuples;
             :     COUNTER_ADD(scan_node_->rows_read_counter(), num_to_commit);
             :     return Status::OK();
             :   }
It would be also nice to unify this with the Parquet version - my understanding 
is that the only difference is using stripe_rows_read_ vs row_group_rows_read_. 
A shared member could be used like row_group_or_stripe_rows_read_ in 
HdfsColumnarScanner


http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@82
PS3, Line 82: cardinality=24
> Modified the cardinality count in HdfsScanNode.java.
Can you mention this as an "addittional change" in the commit message? 
Currently it suggests that only ORC behavior was changed.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@9
PS8, Line 9: bigint
Don't we have a similar counter as for Parquet in the profile?
---- RUNTIME_PROFILE
aggregation(SUM, NumRowGroups): 0



--
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 12:15:14 +0000
Gerrit-HasComments: Yes

Reply via email to