Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-5036: Parquet count star optimization ......................................................................
Patch Set 3: (28 comments) http://gerrit.cloudera.org:8080/#/c/6812/3/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 445: *dst_slot = file_metadata_.row_groups[row_group_idx_].num_rows; > Bounds check against file_metadata_.num_rows (i.e. keep a running counter a Done Line 452: } > Why not else if as in the previous patch set? Else-if seems more accurate. Reverted to else if. (I don't think it matters if we have else if or not, the behavior is identical in both cases) Line 454: if (scan_node_->IsZeroSlotTableScan()) { > Why is this optimization not redundant now? Maybe update the comment to in Done http://gerrit.cloudera.org:8080/#/c/6812/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 226: 11: optional i64 parquet_count_star_slot_offset > Would it be simpler to have this be one parameter and indicate truth by pas Yes, I did something similar. (its now true is if this parameter is set). Line 226: 11: optional i64 parquet_count_star_slot_offset > i32 right? Ah yes, because it's int instead of long in Java. Done http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 248: * Adds a new slot descriptor to the tuple descriptor of this scan. Also adds an entry > * explain what is going to be stored in this new slot descriptor Done Line 249: * to 'optimizedAggSmap_' that replaces a count() with a special sum() function that > that substitutes count(*) with sum_init_zero(<new-slotref>) Done Line 915: msg.hdfs_scan_node.setOptimize_parquet_count_star(optimizedAggSmap_ != null); > Do we need to pass this to the BE? The presence/absence of the parquet_coun Done http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: Line 1213: * table scans. > instead of scanning the table (fix other places below also) Done http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: Line 290: public void testParquetStats() { runPlannerTestFile("parquet-stats-agg"); } > testParquetStatsAgg() Done http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test: Line 1: # Verify that that the parquet count(*) optimization is applied in all the cases. > spell out "in all the cases" a little more and also mention that in one cas Done Line 22: | | output: sum_init_zero(functional_parquet.alltypes.parquet-stats: num_rows) > Can we reduce this to just parquet-stats.num_rows? How do we create such a The slot descriptor label gets printed here that is set on line 263 in HdfsScanNode.java. The full path is printed by default. Are you suggesting to add some kind of extra plumbing how labels get printed? Line 99: ---- DISTRIBUTEDPLAN > Remove here and all tests below. I think showing the distributed plan for t Done Line 114: select month, count(*) from functional_parquet.alltypes group by month, year > Add a negative test for this one: Added a select count(year) from alltypes. Line 172: select max(year), count(*) from functional_parquet.alltypes > use avg() instead of max() because max() is going to be optimized in the sa Done Line 195: # IMPALA-5036 > JIRA number is not very descriptive. Describe what this test case is checki Rewrote. Still feels like the description is not quite right. Line 278: # The count(*) optimization is applied to the inline view even if there is a join. > Add a negative test case that shows the query block must have one table ref Done Line 352: # tinyint_col is not partitioned so the optimization is disabled. > tinyint_col is not a partition column Done Line 402: # Optimization is not applied in the case of count(null). > is not applied to count(null) Done Line 451: # Optimization is not applied because the count(*) is not applied directly to the > Optimization is not applied across query blocks, even though it would be co Done Line 453: select count(*) from ( select int_col from functional_parquet.alltypes) t > Add a new test that shows we only consider materialized agg exprs, somethin Done Line 476: # Optimization is not applied because we are not scanning a Parquet table. > Remove. This case is already covered above. Done http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test: Line 324: WARNING: The following tables are missing relevant table and/or column statistics. > Something wrong with your setup? This table should have stats in our dev se This has nothing to do with my setup. It passes a private build on Jenkins. I think this is here because the way the table is scanned is different. I'll investigate some more what's going on. http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-query/queries/QueryTest/aggregation.test File testdata/workloads/functional-query/queries/QueryTest/aggregation.test: Line 1259: # IMPALA-5036: Tests the correctness of the Parquet count(*) optimization. > Let's move these into a new .test file. Also no need to prefix IMPALA-5036 Done Line 1279: from functional_parquet.alltypes where month > 10 group by year, month > want to remove the predicate here (that case is explicitly called out below Done Line 1303: # IMPALA-5036: Parquet count(*) optimization with the result of the going into a join. > some extra words Done Line 1316: select 1 from functional_parquet.alltypes having count(*) > 1 > Add a count(*) test against an empty table and a table where we filtered al Done http://gerrit.cloudera.org:8080/#/c/6812/3/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: Line 275: exec_option = vector.get_value('exec_option') > Explain what this test is covering. Also would we get coverage of this auto Moved the test. Yes I did confirm that these files have multiple row groups. Added running the test on the tables you suggested. (Those also have many row groups). -- To view, visit http://gerrit.cloudera.org:8080/6812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Zach Amsden <zams...@cloudera.com> Gerrit-HasComments: Yes