Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )
Change subject: IMPALA-11123: Optimize count(star) for ORC scans ...................................................................... Patch Set 2: (3 comments) Thanks for the review, Csaba! I'll work on them. Replying some that I remember now. http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG@18 PS1, Line 18: 'currentTransaction' column. > If there is no delete file in an ACID table, then we should be able to stil I honestly haven't dig deep into this. My purpose is just to avoid optimization when row_batches_need_validation_ == true https://github.com/apache/impala/blob/4aeffe7/be/src/exec/hdfs-orc-scanner.cc#L385-L396 It looks like we need to check that all files of the table is already compacted, and I'm not sure if we can check that in front end. I'm leaning towards filing separate follow-up JIRA since it require us to craft compacted vs uncompacted ACID table case. http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786 PS1, Line 786: // We try to allocate a smaller row batch here because in most cases the number : // stripes in a file is much lower than the default row batch capacity. > Why do we add one row per stripe? We could add them together and always add I tried to do one row per file instead of per stripe, but failed the following test impala-py.test --table_formats "orc/def/block" query_test/test_scanners.py::TestTpchScanRangeLengths::test_tpch_scan_ranges I think that test scenario when one scanner is assigned only parts of the file? Some scanner share a file but work on different stripes? Therefore, I fallback to how parquet scanner do this, which is one row per row group. http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@798 PS1, Line 798: assemble_rows_timer_.Start(); > I think that starting the timer is not too useful here as this logic should Should I just drop the timer, or move it up to cover ResizeAndAllocateTupleBuffer as well? -- To view, visit http://gerrit.cloudera.org:8080/18327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Gerrit-Change-Number: 18327 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Comment-Date: Thu, 17 Mar 2022 16:54:57 +0000 Gerrit-HasComments: Yes
