Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc File be/src/exec/scratch-tuple-batch-test.cc: http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69 PS12, Line 69: 2, 4, 8, 16, 32 > I see. Let us assume the following: Ah, got it! It may not be sufficient though. For instance, 0 1 2 3 4 5 6 7 8 9 0 1 2 3 T T F T F F F T T T T T F F - > we will verify these 2 batches [1,3] and [10, 11] with gap of 5 as correct result even if they are not. Probably some extra conditions might be needed. http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test: http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@436 PS12, Line 436: row_regex:.* RF00.\[min_max\] -. .\.wr_item_sk.* > In addition, I wonder if we can grab a few counters on late materialized ro I had commented on the issue with counters earlier (pasting it below). Let me know your thoughts: --- PASTED --- Thanks Qifan for the review and the suggestion of counter is good and something I pondered about earlier. Issue is that we don't skip decoding rows, instead we skip decoding values where one row may constitute hundreds of values out of which some will be read and others might be skipped. But we cannot accurately keep track number of values being skipped in current scheme of things without incurring significant performance penalty. For instance, we sometimes skip pages without decompressing it - if skipped page has page index with candidate rows we will need to decompress the page to get the accurate values skipped due to late materialisation. In that scenario where we directly skip pages, even if page is not compressed, figuring out number of values for corresponding candidate range can be time consuming. Hence, using timed counters would be more appropriate here, which are already present. -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 12 Gerrit-Owner: Amogh Margoor <[email protected]> Gerrit-Reviewer: Amogh Margoor <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 26 Oct 2021 18:02:14 +0000 Gerrit-HasComments: Yes
