Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-5036: Parquet count star optimization ......................................................................
Patch Set 5: (6 comments) http://gerrit.cloudera.org:8080/#/c/6812/5/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: Line 886: prefix + "8InitZeroIN10impala_udf9BigIntValEEEvPNS2_15FunctionContextEPT_", > rather than defining a new builtin, why don't we wrap the sum() with a coal I gave this a try, but ran into a problem. I tried substituting a count(*) with a zeroifnull(sum(parquet_num_rows)). The aggregation node can only evaluate agg functions, and zeroifnull is not. A potential way out is to modify the merge function so that the input to it is wrapped in a zeroifnull(), but I think that's complicated. We would have make modifications in 2 places if we go that route: replace count() with sum() and wrap the input to the merge function. Simply adding a builtin seems cleaner to me. http://gerrit.cloudera.org:8080/#/c/6812/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: PS5, Line 105: The caller passes in a flag to the constructor that indicates if the count(*) : * optimization can be applied to the query block of this scan. This scan node decides : * whether to apply the optimization or not > This doesn't make a lot of sense to me. On one hand, it sayst he caller dec Reworded slightly. (The main idea is that they are both involved in the decision). Line 109: * > do we actually explain in some comment how this optimization works at a hig This is the only one (I think it's pretty high level). PS5, Line 140: This : // scan does additional analysis in init() to determine whether it is correct to apply : // the optimization. > this seems to contradict the comment above that says the caller passes a fl The optimization is applied if 2 things are true: the query block meets certain requirements and the scan node meets certain requirements. This flag indicates if the query block meets the requirements. http://gerrit.cloudera.org:8080/#/c/6812/5/testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test File testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test: Line 86: # Verify that 0 is returned when we are selecting from an empty table and the optimization > I think it's sufficient to say: Done http://gerrit.cloudera.org:8080/#/c/6812/4/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: Line 280: vector.get_value('exec_option')['batch_size'] = 1 > The core tests we run and the dimensions we use seem somewhat broken/scary Created IMPALA-5603. -- To view, visit http://gerrit.cloudera.org:8080/6812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Zach Amsden <[email protected]> Gerrit-HasComments: Yes
