Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

(3 comments)

Thanks for the review, Csaba! I'll work on them.
Replying some that I remember now.

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG@18
PS1, Line 18: 'currentTransaction' column.
> If there is no delete file in an ACID table, then we should be able to stil
I honestly haven't dig deep into this. My purpose is just to avoid optimization 
when row_batches_need_validation_ == true
https://github.com/apache/impala/blob/4aeffe7/be/src/exec/hdfs-orc-scanner.cc#L385-L396

It looks like we need to check that all files of the table is already 
compacted, and I'm not sure if we can check that in front end.
I'm leaning towards filing separate follow-up JIRA since it require us to craft 
compacted vs uncompacted ACID table case.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786
PS1, Line 786:       // We try to allocate a smaller row batch here because in 
most cases the number
             :       // stripes in a file is much lower than the default row 
batch capacity.
> Why do we add one row per stripe? We could add them together and always add
I tried to do one row per file instead of per stripe, but failed the following 
test

impala-py.test --table_formats "orc/def/block" 
query_test/test_scanners.py::TestTpchScanRangeLengths::test_tpch_scan_ranges

I think that test scenario when one scanner is assigned only parts of the file? 
 Some scanner share a file but work on different stripes?
Therefore, I fallback to how parquet scanner do this, which is one row per row 
group.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@798
PS1, Line 798:         assemble_rows_timer_.Start();
> I think that starting the timer is not too useful here as this logic should
Should I just drop the timer, or move it up to cover 
ResizeAndAllocateTupleBuffer as well?



--
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Thu, 17 Mar 2022 16:54:57 +0000
Gerrit-HasComments: Yes

Reply via email to