[Impala-ASF-CR] IMPALA-5036: Parquet count star optimization

Taras Bobrovytsky (Code Review) Wed, 14 Jun 2017 18:02:19 -0700

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................



Patch Set 3:

(28 comments)

http://gerrit.cloudera.org:8080/#/c/6812/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 445:       *dst_slot = file_metadata_.row_groups[row_group_idx_].num_rows;
> Bounds check against file_metadata_.num_rows (i.e. keep a running counter a
Done


Line 452:   }
> Why not else if as in the previous patch set? Else-if seems more accurate.
Reverted to else if. (I don't think it matters if we have else if or not, the 
behavior is identical in both cases)


Line 454:   if (scan_node_->IsZeroSlotTableScan()) {
> Why is this optimization not redundant now?  Maybe update the comment to in
Done


http://gerrit.cloudera.org:8080/#/c/6812/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 226:   11: optional i64 parquet_count_star_slot_offset
> Would it be simpler to have this be one parameter and indicate truth by pas
Yes, I did something similar. (its now true is if this parameter is set).


Line 226:   11: optional i64 parquet_count_star_slot_offset
> i32 right?
Ah yes, because it's int instead of long in Java. Done


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 248:    * Adds a new slot descriptor to the tuple descriptor of this scan. 
Also adds an entry
> * explain what is going to be stored in this new slot descriptor
Done


Line 249:    * to 'optimizedAggSmap_' that replaces a count() with a special 
sum() function that
> that substitutes count(*) with sum_init_zero(<new-slotref>)
Done


Line 915:     
msg.hdfs_scan_node.setOptimize_parquet_count_star(optimizedAggSmap_ != null);
> Do we need to pass this to the BE? The presence/absence of the parquet_coun
Done


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

Line 1213:    * table scans.
> instead of scanning the table (fix other places below also)
Done


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 290:   public void testParquetStats() { 
runPlannerTestFile("parquet-stats-agg"); }
> testParquetStatsAgg()
Done


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 1: # Verify that that the parquet count(*) optimization is applied in all 
the cases.
> spell out "in all the cases" a little more and also mention that in one cas
Done


Line 22: |  |  output: sum_init_zero(functional_parquet.alltypes.parquet-stats: 
num_rows)
> Can we reduce this to just parquet-stats.num_rows? How do we create such a 
The slot descriptor label gets printed here that is set on line 263 in 
HdfsScanNode.java. The full path is printed by default. Are you suggesting to 
add some kind of extra plumbing how labels get printed?


Line 99: ---- DISTRIBUTEDPLAN
> Remove here and all tests below. I think showing the distributed plan for t
Done


Line 114: select month, count(*) from functional_parquet.alltypes group by 
month, year
> Add a negative test for this one:
Added a select count(year) from alltypes.


Line 172: select max(year), count(*) from functional_parquet.alltypes
> use avg() instead of max() because max() is going to be optimized in the sa
Done


Line 195: # IMPALA-5036
> JIRA number is not very descriptive. Describe what this test case is checki
Rewrote. Still feels like the description is not quite right.


Line 278: # The count(*) optimization is applied to the inline view even if 
there is a join.
> Add a negative test case that shows the query block must have one table ref
Done


Line 352: # tinyint_col is not partitioned so the optimization is disabled.
> tinyint_col is not a partition column
Done


Line 402: # Optimization is not applied in the case of count(null).
> is not applied to count(null)
Done


Line 451: # Optimization is not applied because the count(*) is not applied 
directly to the
> Optimization is not applied across query blocks, even though it would be co
Done


Line 453: select count(*) from ( select int_col from 
functional_parquet.alltypes) t
> Add a new test that shows we only consider materialized agg exprs, somethin
Done


Line 476: # Optimization is not applied because we are not scanning a Parquet 
table.
> Remove. This case is already covered above.
Done


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

Line 324: WARNING: The following tables are missing relevant table and/or 
column statistics.
> Something wrong with your setup? This table should have stats in our dev se
This has nothing to do with my setup. It passes a private build on Jenkins. I 
think this is here because the way the table is scanned is different. I'll 
investigate some more what's going on.


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-query/queries/QueryTest/aggregation.test
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

Line 1259: # IMPALA-5036: Tests the correctness of the Parquet count(*) 
optimization.
> Let's move these into a new .test file. Also no need to prefix IMPALA-5036 
Done


Line 1279: from functional_parquet.alltypes where month > 10 group by year, 
month
> want to remove the predicate here (that case is explicitly called out below
Done


Line 1303: # IMPALA-5036: Parquet count(*) optimization with the result of the 
going into a join.
> some extra words
Done


Line 1316: select 1 from functional_parquet.alltypes having count(*) > 1
> Add a count(*) test against an empty table and a table where we filtered al
Done


http://gerrit.cloudera.org:8080/#/c/6812/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

Line 275:     exec_option = vector.get_value('exec_option')
> Explain what this test is covering. Also would we get coverage of this auto
Moved the test.

Yes I did confirm that these files have multiple row groups. Added running the 
test on the tables you suggested. (Those also have many row groups).


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <zams...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5036: Parquet count star optimization

Reply via email to