[Impala-ASF-CR] IMPALA-5036: Parquet count star optimization

Alex Behm (Code Review) Mon, 12 Jun 2017 13:31:14 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................



Patch Set 3:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/6812/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 452:   }
Why not else if as in the previous patch set? Else-if seems more accurate.


http://gerrit.cloudera.org:8080/#/c/6812/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 226:   11: optional i64 parquet_count_star_slot_offset
i32 right?


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 571:   public FunctionCallExpr getMergeAggInputFn() { return 
mergeAggInputFn_; }
> Yes, they are used on line 617, for example. I don't think it would be a go
I think you can access members directly in L617


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 248:    * Adds a new slot descriptor to the tuple descriptor of this scan. 
Also adds an entry
* explain what is going to be stored in this new slot descriptor
* explain what is returned


Line 249:    * to 'optimizedAggSmap_' that replaces a count() with a special 
sum() function that
that substitutes count(*) with sum_init_zero(<new-slotref>)


Line 915:     
msg.hdfs_scan_node.setOptimize_parquet_count_star(optimizedAggSmap_ != null);
Do we need to pass this to the BE? The presence/absence of the 
parquet_count_star_slot_offset seems enough to decide.


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

Line 1213:    * table scans.
instead of scanning the table (fix other places below also)


http://gerrit.cloudera.org:8080/#/c/6812/3/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 290:   public void testParquetStats() { 
runPlannerTestFile("parquet-stats-agg"); }
testParquetStatsAgg()


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 1: # Verify that that the parquet count(*) optimization is applied in all 
the cases.
spell out "in all the cases" a little more and also mention that in one case we 
expect the optimization to not be applied


Line 22: |  |  output: sum_init_zero(functional_parquet.alltypes.parquet-stats: 
num_rows)
Can we reduce this to just parquet-stats.num_rows? How do we create such a long 
label?


Line 99: ---- DISTRIBUTEDPLAN
Remove here and all tests below. I think showing the distributed plan for the 
first test case is enough.


Line 114: select month, count(*) from functional_parquet.alltypes group by 
month, year
Add a negative test for this one:

select count(*), count(year), count(month),


Line 172: select max(year), count(*) from functional_parquet.alltypes
use avg() instead of max() because max() is going to be optimized in the same 
way in the future (but avg() definitely not)


Line 195: # IMPALA-5036
JIRA number is not very descriptive. Describe what this test case is checking.


Line 278: # The count(*) optimization is applied to the inline view even if 
there is a join.
Add a negative test case that shows the query block must have one table ref, 
e.g.

select count(*) from functional_parquet.alltypes a, functional_parquet.alltypes 
b


Line 352: # tinyint_col is not partitioned so the optimization is disabled.
tinyint_col is not a partition column


Line 402: # Optimization is not applied in the case of count(null).
is not applied to count(null)


Line 451: # Optimization is not applied because the count(*) is not applied 
directly to the
Optimization is not applied across query blocks, even though it would be 
correct here.


Line 453: select count(*) from ( select int_col from 
functional_parquet.alltypes) t
Add a new test that shows we only consider materialized agg exprs, something 
like:

select year, cnt from (
  select year, count(bigint_col), count(*) cnt, avg(int_col)
  from functional_parquet.alltypes
  where month=1
  group by year
)


Line 476: # Optimization is not applied because we are not scanning a Parquet 
table.
Remove. This case is already covered above.


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

Line 324: WARNING: The following tables are missing relevant table and/or 
column statistics.
Something wrong with your setup? This table should have stats in our dev setup.


http://gerrit.cloudera.org:8080/#/c/6812/3/testdata/workloads/functional-query/queries/QueryTest/aggregation.test
File testdata/workloads/functional-query/queries/QueryTest/aggregation.test:

Line 1259: # IMPALA-5036: Tests the correctness of the Parquet count(*) 
optimization.
Let's move these into a new .test file. Also no need to prefix IMPALA-5036 
everywhere (that's mostly for regression tests, but this is a new feature)


Line 1279: from functional_parquet.alltypes where month > 10 group by year, 
month
want to remove the predicate here (that case is explicitly called out below)


Line 1303: # IMPALA-5036: Parquet count(*) optimization with the result of the 
going into a join.
some extra words


Line 1316: select 1 from functional_parquet.alltypes having count(*) > 1
Add a count(*) test against an empty table and a table where we filtered all 
partition columns to make sure that 0 is returned.


http://gerrit.cloudera.org:8080/#/c/6812/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

Line 275:     exec_option = vector.get_value('exec_option')
Explain what this test is covering. Also would we get coverage of this 
automatically if we moved this test into the existing .test file (where your 
other tests live)? In exhaustive I believe we do run with different batch sizes 
including 1.

Did you confirm that these files have multiple row groups?

Might also be good to run a count(*) on these tables:
functional_parquet.lineitem_multiblock_one_row_group
functional_parquet.lineitem_sixblocks
functional_parquet.lineitem_multiblock

(You can take a why other tests use them)


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5036: Parquet count star optimization

Reply via email to