Alex Behm has posted comments on this change. Change subject: IMPALA-5036: Parquet count star optimization ......................................................................
Patch Set 3: (26 comments) http://gerrit.cloudera.org:8080/#/c/6812/3/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 452: } Why not else if as in the previous patch set? Else-if seems more accurate. http://gerrit.cloudera.org:8080/#/c/6812/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 226: 11: optional i64 parquet_count_star_slot_offset i32 right? http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: Line 571: public FunctionCallExpr getMergeAggInputFn() { return mergeAggInputFn_; } > Yes, they are used on line 617, for example. I don't think it would be a go I think you can access members directly in L617 http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 248: * Adds a new slot descriptor to the tuple descriptor of this scan. Also adds an entry * explain what is going to be stored in this new slot descriptor * explain what is returned Line 249: * to 'optimizedAggSmap_' that replaces a count() with a special sum() function that that substitutes count(*) with sum_init_zero(<new-slotref>) Line 915: msg.hdfs_scan_node.setOptimize_parquet_count_star(optimizedAggSmap_ != null); Do we need to pass this to the BE? The presence/absence of the parquet_count_star_slot_offset seems enough to decide. http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: Line 1213: * table scans. instead of scanning the table (fix other places below also) http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: Line 290: public void testParquetStats() { runPlannerTestFile("parquet-stats-agg"); } testParquetStatsAgg() http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test: Line 1: # Verify that that the parquet count(*) optimization is applied in all the cases. spell out "in all the cases" a little more and also mention that in one case we expect the optimization to not be applied Line 22: | | output: sum_init_zero(functional_parquet.alltypes.parquet-stats: num_rows) Can we reduce this to just parquet-stats.num_rows? How do we create such a long label? Line 99: ---- DISTRIBUTEDPLAN Remove here and all tests below. I think showing the distributed plan for the first test case is enough. Line 114: select month, count(*) from functional_parquet.alltypes group by month, year Add a negative test for this one: select count(*), count(year), count(month), Line 172: select max(year), count(*) from functional_parquet.alltypes use avg() instead of max() because max() is going to be optimized in the same way in the future (but avg() definitely not) Line 195: # IMPALA-5036 JIRA number is not very descriptive. Describe what this test case is checking. Line 278: # The count(*) optimization is applied to the inline view even if there is a join. Add a negative test case that shows the query block must have one table ref, e.g. select count(*) from functional_parquet.alltypes a, functional_parquet.alltypes b Line 352: # tinyint_col is not partitioned so the optimization is disabled. tinyint_col is not a partition column Line 402: # Optimization is not applied in the case of count(null). is not applied to count(null) Line 451: # Optimization is not applied because the count(*) is not applied directly to the Optimization is not applied across query blocks, even though it would be correct here. Line 453: select count(*) from ( select int_col from functional_parquet.alltypes) t Add a new test that shows we only consider materialized agg exprs, something like: select year, cnt from ( select year, count(bigint_col), count(*) cnt, avg(int_col) from functional_parquet.alltypes where month=1 group by year ) Line 476: # Optimization is not applied because we are not scanning a Parquet table. Remove. This case is already covered above. http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test: Line 324: WARNING: The following tables are missing relevant table and/or column statistics. Something wrong with your setup? This table should have stats in our dev setup. http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-query/queries/QueryTest/aggregation.test File testdata/workloads/functional-query/queries/QueryTest/aggregation.test: Line 1259: # IMPALA-5036: Tests the correctness of the Parquet count(*) optimization. Let's move these into a new .test file. Also no need to prefix IMPALA-5036 everywhere (that's mostly for regression tests, but this is a new feature) Line 1279: from functional_parquet.alltypes where month > 10 group by year, month want to remove the predicate here (that case is explicitly called out below) Line 1303: # IMPALA-5036: Parquet count(*) optimization with the result of the going into a join. some extra words Line 1316: select 1 from functional_parquet.alltypes having count(*) > 1 Add a count(*) test against an empty table and a table where we filtered all partition columns to make sure that 0 is returned. http://gerrit.cloudera.org:8080/#/c/6812/3/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: Line 275: exec_option = vector.get_value('exec_option') Explain what this test is covering. Also would we get coverage of this automatically if we moved this test into the existing .test file (where your other tests live)? In exhaustive I believe we do run with different batch sizes including 1. Did you confirm that these files have multiple row groups? Might also be good to run a count(*) on these tables: functional_parquet.lineitem_multiblock_one_row_group functional_parquet.lineitem_sixblocks functional_parquet.lineitem_multiblock (You can take a why other tests use them) -- To view, visit http://gerrit.cloudera.org:8080/6812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-HasComments: Yes