This is an automated email from the ASF dual-hosted git repository. reshke pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/cloudberry.git
commit c069dfbcb44be51d34f6f2afd5d051a11f05eaf8 Author: Maxim Smyatkin <[email protected]> AuthorDate: Tue Feb 21 18:45:46 2023 +0300 Force two-stage local aggregate to remove duplicates In a two-stage hash aggregate plan, if the local stage aggregate hash table fills up then it has one of two options: 1) spill to disk 2) stream to next stage aggreegate. Issue is if we stream a DQA to next stage and the next stage isn't prepared to handle duplicates, then we could encounter wrong results. That was observed in the following plan: SELECT a, COUNT(DISTINCT b) FROM t GROUP BY a; QUERY PLAN -------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..3366.93 rows=7 width=12) -> HashAggregate (cost=0.00..3366.93 rows=3 width=12) Group Key: a -> HashAggregate (cost=0.00..3212.44 rows=1273152 width=8) Group Key: a, b -> Seq Scan on test (cost=0.00..640.00 rows=10000034 width=8) Optimizer: Pivotal Optimizer (GPORCA) (7 rows) The second HashAggregate used hybrid hashtable in streaming mode. That is, when the hashtable enumerating all distinct pairs of A and B fills up, we start it over instead of spilling to disc. This results in duplicates and we get more tuples as a result of this query. It seems that we are using streaming hashtable, because optimizer thinks that it is okay to produce duplicates for this plan HashAggregate (<dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> from minidump), but it is not. The fix here simply prohibits the aggregate node from producing duplicates, and it resolves the problem. Authored-by: Smyatkin Maxim (with minor edits by David Kimura) --- .../gporca/data/dxl/minidump/AggregateWithSkew.mdp | 2 +- .../dxl/minidump/DQA-GroupBy-HashAggregate1.mdp | 2 +- .../dxl/minidump/DQA-GroupBy-HashAggregate2.mdp | 2 +- .../DQA-InnerJoin-GroupBy-HashAggregate.mdp | 2 +- .../gporca/data/dxl/minidump/DqaNoRedistribute.mdp | 10 ++-- ...tiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp | 2 +- .../dxl/minidump/ScalarDQAWithNonScalarAgg.mdp | 2 +- .../minidump/ThreeStageAgg-DistinctOnDistrCol.mdp | 14 ++--- ...eeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp | 14 ++--- .../ThreeStageAgg-GbandDistinctOnDistrCol.mdp | 12 ++-- .../gporca/libgpopt/src/xforms/CXformSplitDQA.cpp | 6 +- src/test/isolation2/expected/spilling_hashagg.out | 68 ++++++++++++++++++++++ .../expected/spilling_hashagg_optimizer.out | 68 ++++++++++++++++++++++ src/test/isolation2/isolation2_schedule | 2 + src/test/isolation2/sql/spilling_hashagg.sql | 49 ++++++++++++++++ src/test/regress/expected/gp_dqa_optimizer.out | 4 +- 16 files changed, 222 insertions(+), 37 deletions(-) diff --git a/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp b/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp index ed063dbeb8..833ea3c7fc 100644 --- a/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp +++ b/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp @@ -707,7 +707,7 @@ group by b; <dxl:Ident ColId="1" ColName="b" TypeMdid="0.23.1.0"/> </dxl:HashExpr> </dxl:HashExprList> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> <dxl:Cost StartupCost="0" TotalCost="715.260000" Rows="2000000.000000" Width="8"/> </dxl:Properties> diff --git a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp index 67dee1e692..a6f524e37d 100644 --- a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp +++ b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp @@ -287,7 +287,7 @@ </dxl:ProjElem> </dxl:ProjList> <dxl:Filter/> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> <dxl:Cost StartupCost="0" TotalCost="431.000161" Rows="3.000000" Width="8"/> </dxl:Properties> diff --git a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp index 6b842edca9..fd155c0d76 100644 --- a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp +++ b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp @@ -363,7 +363,7 @@ <dxl:Ident ColId="1" ColName="b" TypeMdid="0.23.1.0"/> </dxl:HashExpr> </dxl:HashExprList> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> <dxl:Cost StartupCost="0" TotalCost="431.001880" Rows="20.000000" Width="8"/> </dxl:Properties> diff --git a/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp b/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp index 79acfeb9cb..e7507cc47c 100644 --- a/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp +++ b/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp @@ -556,7 +556,7 @@ </dxl:ProjElem> </dxl:ProjList> <dxl:Filter/> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> <dxl:Cost StartupCost="0" TotalCost="862.003697" Rows="10.000001" Width="16"/> </dxl:Properties> diff --git a/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp b/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp index 0697286479..afaf1f0bca 100644 --- a/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp +++ b/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp @@ -707,7 +707,7 @@ <dxl:Plan Id="0" SpaceSize="18"> <dxl:GatherMotion InputSegments="0,1,2" OutputSegments="-1"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="431.972943" Rows="101.000000" Width="12"/> + <dxl:Cost StartupCost="0" TotalCost="431.965888" Rows="101.000000" Width="12"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -721,7 +721,7 @@ <dxl:SortingColumnList/> <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="431.968426" Rows="101.000000" Width="12"/> + <dxl:Cost StartupCost="0" TotalCost="431.961371" Rows="101.000000" Width="12"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="0"/> @@ -744,7 +744,7 @@ <dxl:Filter/> <dxl:RedistributeMotion InputSegments="0,1,2" OutputSegments="0,1,2"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="431.945220" Rows="568.125000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="431.938165" Rows="568.125000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -761,9 +761,9 @@ <dxl:Ident ColId="0" ColName="a" TypeMdid="0.23.1.0"/> </dxl:HashExpr> </dxl:HashExprList> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="431.940478" Rows="568.125000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="431.933423" Rows="568.125000" Width="8"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="0"/> diff --git a/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp b/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp index c75b21f9c4..7c7633fefd 100644 --- a/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp +++ b/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp @@ -503,7 +503,7 @@ </dxl:ProjElem> </dxl:ProjList> <dxl:Filter/> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> <dxl:Cost StartupCost="0" TotalCost="862.004184" Rows="10.000001" Width="20"/> </dxl:Properties> diff --git a/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp b/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp index b2b82f9efc..bcfa49bb55 100644 --- a/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp +++ b/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp @@ -1496,7 +1496,7 @@ <dxl:Ident ColId="2" ColName="c" TypeMdid="0.23.1.0"/> </dxl:HashExpr> </dxl:HashExprList> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> <dxl:Cost StartupCost="0" TotalCost="441.310000" Rows="100000.000000" Width="8"/> </dxl:Properties> diff --git a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp index f04c7e2717..352ecaccf5 100644 --- a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp +++ b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp @@ -263,7 +263,7 @@ <dxl:Plan Id="0" SpaceSize="38"> <dxl:GatherMotion InputSegments="0,1" OutputSegments="-1"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="572.182073" Rows="4.000000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="571.058696" Rows="4.000000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="10" Alias="sum"> @@ -274,7 +274,7 @@ <dxl:SortingColumnList/> <dxl:Result> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="572.181930" Rows="4.000000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="571.058552" Rows="4.000000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="10" Alias="sum"> @@ -285,7 +285,7 @@ <dxl:OneTimeFilter/> <dxl:Aggregate AggregationStrategy="Sorted" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="572.181930" Rows="4.000000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="571.058552" Rows="4.000000" Width="8"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="1"/> @@ -308,7 +308,7 @@ <dxl:Filter/> <dxl:Sort SortDiscardDuplicates="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="572.181903" Rows="4.500000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="571.058525" Rows="4.500000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -326,7 +326,7 @@ <dxl:LimitOffset/> <dxl:RedistributeMotion InputSegments="0,1" OutputSegments="0,1"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="572.181783" Rows="4.500000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="571.058406" Rows="4.500000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -343,9 +343,9 @@ <dxl:Ident ColId="1" ColName="b" TypeMdid="0.23.1.0"/> </dxl:HashExpr> </dxl:HashExprList> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="572.181727" Rows="4.500000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="571.058350" Rows="4.500000" Width="8"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="1"/> diff --git a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp index c41f9b099f..7980a2280b 100644 --- a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp +++ b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp @@ -264,7 +264,7 @@ <dxl:Plan Id="0" SpaceSize="38"> <dxl:GatherMotion InputSegments="0,1" OutputSegments="-1"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="637.664977" Rows="11.250000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="635.979922" Rows="11.250000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="10" Alias="sum"> @@ -275,7 +275,7 @@ <dxl:SortingColumnList/> <dxl:Result> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="637.664573" Rows="11.250000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="635.979518" Rows="11.250000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="10" Alias="sum"> @@ -286,7 +286,7 @@ <dxl:OneTimeFilter/> <dxl:Aggregate AggregationStrategy="Sorted" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="637.664573" Rows="11.250000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="635.979518" Rows="11.250000" Width="8"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="1"/> @@ -313,7 +313,7 @@ <dxl:Filter/> <dxl:Sort SortDiscardDuplicates="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="637.664480" Rows="11.250000" Width="12"/> + <dxl:Cost StartupCost="0" TotalCost="635.979425" Rows="11.250000" Width="12"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -335,7 +335,7 @@ <dxl:LimitOffset/> <dxl:RedistributeMotion InputSegments="0,1" OutputSegments="0,1"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="637.663526" Rows="11.250000" Width="12"/> + <dxl:Cost StartupCost="0" TotalCost="635.978472" Rows="11.250000" Width="12"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -358,9 +358,9 @@ <dxl:Ident ColId="2" ColName="c" TypeMdid="0.23.1.0"/> </dxl:HashExpr> </dxl:HashExprList> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="637.663315" Rows="11.250000" Width="12"/> + <dxl:Cost StartupCost="0" TotalCost="635.978260" Rows="11.250000" Width="12"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="1"/> diff --git a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp index a41741dae7..d4ff364734 100644 --- a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp +++ b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp @@ -263,7 +263,7 @@ <dxl:Plan Id="0" SpaceSize="11"> <dxl:GatherMotion InputSegments="0,1" OutputSegments="-1"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="507.148781" Rows="2.000000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="506.587091" Rows="2.000000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="10" Alias="sum"> @@ -274,7 +274,7 @@ <dxl:SortingColumnList/> <dxl:Result> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="507.148709" Rows="2.000000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="506.587019" Rows="2.000000" Width="8"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="10" Alias="sum"> @@ -285,7 +285,7 @@ <dxl:OneTimeFilter/> <dxl:Aggregate AggregationStrategy="Sorted" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="507.148709" Rows="2.000000" Width="8"/> + <dxl:Cost StartupCost="0" TotalCost="506.587019" Rows="2.000000" Width="8"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="0"/> @@ -308,7 +308,7 @@ <dxl:Filter/> <dxl:Sort SortDiscardDuplicates="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="507.148701" Rows="2.000000" Width="4"/> + <dxl:Cost StartupCost="0" TotalCost="506.587011" Rows="2.000000" Width="4"/> </dxl:Properties> <dxl:ProjList> <dxl:ProjElem ColId="0" Alias="a"> @@ -321,9 +321,9 @@ </dxl:SortingColumnList> <dxl:LimitCount/> <dxl:LimitOffset/> - <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> + <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false"> <dxl:Properties> - <dxl:Cost StartupCost="0" TotalCost="507.148701" Rows="2.000000" Width="4"/> + <dxl:Cost StartupCost="0" TotalCost="506.587011" Rows="2.000000" Width="4"/> </dxl:Properties> <dxl:GroupingColumns> <dxl:GroupingColumn ColId="0"/> diff --git a/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp b/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp index dbc325a33c..3567968736 100644 --- a/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp +++ b/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp @@ -738,13 +738,11 @@ CXformSplitDQA::PexprMultiLevelAggregation( CExpressionArray *pdrgpexprLastStage = pdrgpexprPrElSecondStage; if (fSplit2LevelsOnly) { - // for scalar DQA the local aggregate is responsible for removing duplicates - BOOL fLocalAggGeneratesDuplicates = (0 < pdrgpcrLastStage->Size()); - + // the local aggregate is responsible for removing duplicates pdrgpcrArgDQA->AddRef(); popFirstStage = GPOS_NEW(mp) CLogicalGbAgg( mp, pdrgpcrLocal, COperator::EgbaggtypeLocal, - fLocalAggGeneratesDuplicates, pdrgpcrArgDQA, aggStage); + false /* fGeneratesDuplicates */, pdrgpcrArgDQA, aggStage); pdrgpcrLastStage->AddRef(); popSecondStage = GPOS_NEW(mp) CLogicalGbAgg( mp, pdrgpcrLastStage, COperator::EgbaggtypeGlobal, /* egbaggtype */ diff --git a/src/test/isolation2/expected/spilling_hashagg.out b/src/test/isolation2/expected/spilling_hashagg.out new file mode 100644 index 0000000000..8f01e1e241 --- /dev/null +++ b/src/test/isolation2/expected/spilling_hashagg.out @@ -0,0 +1,68 @@ +-- start_ignore +-- end_ignore + +-- Test Orca properly removes duplicates in DQA +-- (https://github.com/greenplum-db/gpdb/issues/14993) + +-- GPDB_12_MERGE_FEATURE_NOT_SUPPORTED: After streaming hash aggregates are +-- supported then add a fault injection for 'force_hashagg_stream_hashtable'. +-- Until then this test doesn't actually test spilling. + +CREATE TABLE test_src_tbl AS WITH cte1 AS ( SELECT field5 from generate_series(1,1000) field5 ) SELECT field5 % 100 AS a, field5 % 100 + 1 AS b FROM cte1 DISTRIBUTED BY (a); +CREATE 1000 +ANALYZE test_src_tbl; +ANALYZE + + +-- Use isolation2 framework to force a streaming hash aggregate to clear the +-- hash table and stream tuples to next stage aggregate. This is to simulate +-- hash table spills after 100 tuples inserted any segment. +SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'skip', '', '', '', 100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p'; + gp_inject_fault +----------------- + Success: + Success: + Success: + Success: +(4 rows) +CREATE TABLE test_hashagg_on AS SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; +CREATE 100 +EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; + QUERY PLAN +-------------------------------------------- + Gather Motion 3:1 (slice1; segments: 3) + -> HashAggregate + Group Key: a + -> HashAggregate + Group Key: a, b + -> Seq Scan on test_src_tbl + Optimizer: Postgres query optimizer +(7 rows) + +-- Compare results against a group aggregate plan. +set optimizer_enable_hashagg=off; +SET +CREATE TABLE test_hashagg_off AS SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; +CREATE 100 +EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; + QUERY PLAN +-------------------------------------------- + Gather Motion 3:1 (slice1; segments: 3) + -> HashAggregate + Group Key: a + -> HashAggregate + Group Key: a, b + -> Seq Scan on test_src_tbl + Optimizer: Postgres query optimizer +(7 rows) + +-- Results should match +SELECT (n_total=n_matches) AS match FROM ( SELECT COUNT(*) n_total, SUM(CASE WHEN t1.b = t2.b THEN 1 ELSE 0 END) n_matches FROM test_hashagg_on t1 JOIN test_hashagg_off t2 ON t1.a = t2.a) t; + match +------- + t +(1 row) + + +-- start_ignore +-- end_ignore diff --git a/src/test/isolation2/expected/spilling_hashagg_optimizer.out b/src/test/isolation2/expected/spilling_hashagg_optimizer.out new file mode 100644 index 0000000000..8f0f54e200 --- /dev/null +++ b/src/test/isolation2/expected/spilling_hashagg_optimizer.out @@ -0,0 +1,68 @@ +-- start_ignore +-- end_ignore + +-- Test Orca properly removes duplicates in DQA +-- (https://github.com/greenplum-db/gpdb/issues/14993) + +-- GPDB_12_MERGE_FEATURE_NOT_SUPPORTED: After streaming hash aggregates are +-- supported then add a fault injection for 'force_hashagg_stream_hashtable'. +-- Until then this test doesn't actually test spilling. + +CREATE TABLE test_src_tbl AS WITH cte1 AS ( SELECT field5 from generate_series(1,1000) field5 ) SELECT field5 % 100 AS a, field5 % 100 + 1 AS b FROM cte1 DISTRIBUTED BY (a); +CREATE 1000 +ANALYZE test_src_tbl; +ANALYZE + + +-- Use isolation2 framework to force a streaming hash aggregate to clear the +-- hash table and stream tuples to next stage aggregate. This is to simulate +-- hash table spills after 100 tuples inserted any segment. +SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'skip', '', '', '', 100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p'; + gp_inject_fault +----------------- + Success: + Success: + Success: + Success: +(4 rows) +CREATE TABLE test_hashagg_on AS SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; +CREATE 100 +EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; + QUERY PLAN +-------------------------------------------- + Gather Motion 3:1 (slice1; segments: 3) + -> HashAggregate + Group Key: a + -> HashAggregate + Group Key: a, b + -> Seq Scan on test_src_tbl + Optimizer: Pivotal Optimizer (GPORCA) +(7 rows) + +-- Compare results against a group aggregate plan. +set optimizer_enable_hashagg=off; +SET +CREATE TABLE test_hashagg_off AS SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; +CREATE 100 +EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; + QUERY PLAN +-------------------------------------------- + Gather Motion 3:1 (slice1; segments: 3) + -> GroupAggregate + Group Key: a + -> Sort + Sort Key: a + -> Seq Scan on test_src_tbl + Optimizer: Pivotal Optimizer (GPORCA) +(7 rows) + +-- Results should match +SELECT (n_total=n_matches) AS match FROM ( SELECT COUNT(*) n_total, SUM(CASE WHEN t1.b = t2.b THEN 1 ELSE 0 END) n_matches FROM test_hashagg_on t1 JOIN test_hashagg_off t2 ON t1.a = t2.a) t; + match +------- + t +(1 row) + + +-- start_ignore +-- end_ignore diff --git a/src/test/isolation2/isolation2_schedule b/src/test/isolation2/isolation2_schedule index 588fd326e5..5b24284ab6 100644 --- a/src/test/isolation2/isolation2_schedule +++ b/src/test/isolation2/isolation2_schedule @@ -311,6 +311,8 @@ test: aocs_unique_index test: uao/ao_unique_index_vacuum_row test: uao/ao_unique_index_vacuum_column +test: spilling_hashagg + test: local_directory_table_mixed test: stat_activity_extended test: sharedsnapshot diff --git a/src/test/isolation2/sql/spilling_hashagg.sql b/src/test/isolation2/sql/spilling_hashagg.sql new file mode 100644 index 0000000000..eaa1564964 --- /dev/null +++ b/src/test/isolation2/sql/spilling_hashagg.sql @@ -0,0 +1,49 @@ +-- start_ignore +CREATE EXTENSION IF NOT EXISTS gp_inject_fault; +DROP TABLE IF EXISTS test_src_tbl; +DROP TABLE IF EXISTS test_hashagg_on; +DROP TABLE IF EXISTS test_hashagg_off; +-- end_ignore + +-- Test Orca properly removes duplicates in DQA +-- (https://github.com/greenplum-db/gpdb/issues/14993) + +-- GPDB_12_MERGE_FEATURE_NOT_SUPPORTED: After streaming hash aggregates are +-- supported then add a fault injection for 'force_hashagg_stream_hashtable'. +-- Until then this test doesn't actually test spilling. + +CREATE TABLE test_src_tbl AS +WITH cte1 AS ( + SELECT field5 from generate_series(1,1000) field5 +) +SELECT field5 % 100 AS a, field5 % 100 + 1 AS b +FROM cte1 DISTRIBUTED BY (a); +ANALYZE test_src_tbl; + + +-- Use isolation2 framework to force a streaming hash aggregate to clear the +-- hash table and stream tuples to next stage aggregate. This is to simulate +-- hash table spills after 100 tuples inserted any segment. +SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'skip', '', '', '', 100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p'; +CREATE TABLE test_hashagg_on AS +SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; +EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; + +-- Compare results against a group aggregate plan. +set optimizer_enable_hashagg=off; +CREATE TABLE test_hashagg_off AS +SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; +EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a; + +-- Results should match +SELECT (n_total=n_matches) AS match FROM ( +SELECT COUNT(*) n_total, SUM(CASE WHEN t1.b = t2.b THEN 1 ELSE 0 END) n_matches +FROM test_hashagg_on t1 +JOIN test_hashagg_off t2 ON t1.a = t2.a) t; + + +-- start_ignore +SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'status', '', '', '', 100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p'; +SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'reset', '', '', '', 100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p'; +RESET ALL; +-- end_ignore diff --git a/src/test/regress/expected/gp_dqa_optimizer.out b/src/test/regress/expected/gp_dqa_optimizer.out index b73c793c1d..71ad1cde46 100644 --- a/src/test/regress/expected/gp_dqa_optimizer.out +++ b/src/test/regress/expected/gp_dqa_optimizer.out @@ -2841,7 +2841,7 @@ explain (verbose on, costs off)select sum(Distinct a), count(b), sum(c) from dqa -> Redistribute Motion 3:3 (slice2; segments: 3) Output: a, e, (PARTIAL count(b)), (PARTIAL sum(c)) Hash Key: e - -> Streaming Partial HashAggregate + -> Partial HashAggregate Output: a, e, PARTIAL count(b), PARTIAL sum(c) Group Key: dqa_f3.e, dqa_f3.a -> Seq Scan on public.dqa_f3 @@ -2870,7 +2870,7 @@ explain (verbose on, costs off) select sum(Distinct e), count(b), sum(c) from dq -> Finalize HashAggregate Output: sum(e), count(b), sum(c) Group Key: dqa_f3.a - -> Streaming Partial HashAggregate + -> Partial HashAggregate Output: PARTIAL count(b), PARTIAL sum(c), a, e Group Key: dqa_f3.a, dqa_f3.e -> Seq Scan on public.dqa_f3 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
