Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@772
PS1, Line 772: Status HdfsOrcScanner::GetNextInternal(RowBatch* row_batch) {
This method is more than 100 lines now. Could you extract the count-star code 
into a separated method?


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786
PS1, Line 786:       // We try to allocate a smaller row batch here because in 
most cases the number
             :       // stripes in a file is much lower than the default row 
batch capacity.
> I tried to do one row per file instead of per stripe, but failed the follow
Each scanner thread processes a split which could be part of the file. Only 
stripes whose mid point locates inside the split will be processed by the 
scanner (picked in NextStripe()):
https://github.com/apache/impala/blob/1739edf2d97009062cb339a3c276f01dbd4d33bd/be/src/exec/hdfs-orc-scanner.cc#L903-L910

I think the missing part of returning one row per file is here:
https://github.com/apache/impala/blob/1739edf2d97009062cb339a3c276f01dbd4d33bd/be/src/exec/hdfs-scanner.cc#L841-L846
The same as the IsZeroSlotTableScan() case, we just need the scanner of the 
footer split.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@818
PS1, Line 818:       uint64_t file_rows = reader_->getNumberOfRows();
             :       if (stripe_rows_read_ == file_rows) {
> We didn't change this logic, but it seems a bit suspicious to me: will thes
This is a special case that only the scanner of the footer split will process 
the whole file:
https://github.com/apache/impala/blob/1739edf2d97009062cb339a3c276f01dbd4d33bd/be/src/exec/hdfs-scanner.cc#L841-L846


http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@4
PS2, Line 4: functional_orc_def.uncomp_src_alltypes
This table is a managed table. I'm not sure why its table properties don't have 
the transaction settings. But I think we need another table that is 
non-transactional, e.g. in tpch_orc_def.


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@279
PS2, Line 279:     if (vector.get_value('table_format').file_format != 'text' or
             :           vector.get_value('table_format').compression_codec != 
'none'):
Shouldn't we only run this on ORC? It seems the other tests for parquet and 
kudu also run on other formats, which I think is redudant.


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@283
PS2, Line 283: unique_database
It seems we don't need 'unique_database' since all tables we used are in 
functional_orc_def.



--
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Sat, 19 Mar 2022 12:53:21 +0000
Gerrit-HasComments: Yes

Reply via email to