Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13993 )
Change subject: IMPALA-8834: Short-circuit partition key scan ...................................................................... Patch Set 16: (6 comments) http://gerrit.cloudera.org:8080/#/c/13993/16//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13993/16//COMMIT_MSG@17 PS16, Line 17: THe > nit: The Done http://gerrit.cloudera.org:8080/#/c/13993/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/13993/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1135 PS16, Line 1135: 100 > nit: should this magic number be replaced with analyzer.getQueryOptions().e This is a different value - it's an estimate of the relative cost of opening a file vs reading a single row. I factored it out into a separate constant. I didn't necessarily want to add another tuning parameter. http://gerrit.cloudera.org:8080/#/c/13993/16/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test File testdata/workloads/functional-planner/queries/PlannerTest/distinct.test: http://gerrit.cloudera.org:8080/#/c/13993/16/testdata/workloads/functional-planner/queries/PlannerTest/distinct.test@526 PS16, Line 526: partition key scan : row-size=0B cardinality=24 > hmm... Did you run the test with some extract options? This is what I get locally on my branch. [localhost.EXAMPLE.COM:21000] default> set num_nodes=1; explain select 1 from (select count(distinct 1) x from functional.alltypes) t where t.x is not null; NUM_NODES set to 1 Query: explain select 1 from (select count(distinct 1) x from functional.alltypes) t where t.x is not null +------------------------------------------------------------+ | Explain String | +------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=1.97MB Threads=2 | | Per-Host Resource Estimates: Memory=138MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 02:AGGREGATE [FINALIZE] | | | output: count(1) | | | having: count(1) IS NOT NULL | | | row-size=8B cardinality=1 | | | | | 01:AGGREGATE | | | group by: 1 | | | row-size=1B cardinality=1 | | | | | 00:SCAN HDFS [functional.alltypes] | | HDFS partitions=24/24 files=24 size=478.45KB | | partition key scan | | row-size=0B cardinality=24 | +------------------------------------------------------------+ http://gerrit.cloudera.org:8080/#/c/13993/16/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test: http://gerrit.cloudera.org:8080/#/c/13993/16/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@13 PS16, Line 13: aggregation(SUM, RowsReturned): 48 > Are we checking 48 == 2 * NumFiles? If so, how does the factor (2) come? Sh It's actually the total across all the operators in the plans. I actually realised that this assumes a 3-node minicluster, so I pulled the RowsReturned check into a separate test that only runs in the HDFS minicluster configuration. +---------------------+--------+-------+----------+----------+-------+------------+-----------+---------------+---------------------+ | Operator | #Hosts | #Inst | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +---------------------+--------+-------+----------+----------+-------+------------+-----------+---------------+---------------------+ | F02:ROOT | 1 | 1 | 31.57us | 31.57us | | | 0 B | 0 B | | | 04:EXCHANGE | 1 | 1 | 44.26us | 44.26us | 2 | 2 | 24.00 KB | 16.00 KB | UNPARTITIONED | | F01:EXCHANGE SENDER | 3 | 3 | 88.39us | 132.93us | | | 25.59 KB | 0 B | | | 03:AGGREGATE | 3 | 3 | 1.66ms | 1.89ms | 2 | 2 | 1.95 MB | 10.00 MB | FINALIZE | | 02:EXCHANGE | 3 | 3 | 26.16us | 48.34us | 6 | 2 | 48.00 KB | 16.00 KB | HASH(`year`) | | F00:EXCHANGE SENDER | 3 | 3 | 158.73us | 169.28us | | | 80.78 KB | 0 B | | | 01:AGGREGATE | 3 | 3 | 1.19ms | 1.25ms | 6 | 2 | 2.02 MB | 10.00 MB | STREAMING | | 00:SCAN HDFS | 3 | 3 | 2.75ms | 2.92ms | 24 | 24 | 364.00 KB | 128.00 MB | functional.alltypes | +---------------------+--------+-------+----------+----------+-------+------------+-----------+---------------+---------------------+ http://gerrit.cloudera.org:8080/#/c/13993/16/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@30 PS16, Line 30: order by year, month > nit: I think we don't need the ORDER BY clause. The test framework will sor Done http://gerrit.cloudera.org:8080/#/c/13993/16/tests/query_test/test_queries.py File tests/query_test/test_queries.py: http://gerrit.cloudera.org:8080/#/c/13993/16/tests/query_test/test_queries.py@268 PS16, Line 268: if vector.get_value('exec_option')['mt_dop'] != 0: I can update this after rebasing -- To view, visit http://gerrit.cloudera.org:8080/13993 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I26c87525a4f75ffeb654267b89948653b2e1ff8c Gerrit-Change-Number: 13993 Gerrit-PatchSet: 16 Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 19 May 2020 17:17:35 +0000 Gerrit-HasComments: Yes
