(cloudberry) 01/43: Force two-stage local aggregate to remove duplicates

reshke Wed, 12 Feb 2025 22:05:41 -0800

This is an automated email from the ASF dual-hosted git repository.

reshke pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry.git


commit c069dfbcb44be51d34f6f2afd5d051a11f05eaf8
Author: Maxim Smyatkin <[email protected]>
AuthorDate: Tue Feb 21 18:45:46 2023 +0300

    Force two-stage local aggregate to remove duplicates
    
    In a two-stage hash aggregate plan, if the local stage aggregate hash
    table fills up then it has one of two options:
    
     1) spill to disk
     2) stream to next stage aggreegate.
    
    Issue is if we stream a DQA to next stage and the next stage isn't
    prepared to handle duplicates, then we could encounter wrong results.
    That was observed in the following plan:
    
        SELECT a, COUNT(DISTINCT b) FROM t GROUP BY a;
    
                                           QUERY PLAN
        
--------------------------------------------------------------------------------
         Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..3366.93 rows=7 
width=12)
           ->  HashAggregate  (cost=0.00..3366.93 rows=3 width=12)
                 Group Key: a
                 ->  HashAggregate  (cost=0.00..3212.44 rows=1273152 width=8)
                       Group Key: a, b
                       ->  Seq Scan on test  (cost=0.00..640.00 rows=10000034 
width=8)
         Optimizer: Pivotal Optimizer (GPORCA)
        (7 rows)
    
    The second HashAggregate used hybrid hashtable in streaming mode. That
    is, when the hashtable enumerating all distinct pairs of A and B fills
    up, we start it over instead of spilling to disc. This results in
    duplicates and we get more tuples as a result of this query.
    
    It seems that we are using streaming hashtable, because optimizer thinks
    that it is okay to produce duplicates for this plan HashAggregate
    (<dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true"> from
    minidump), but it is not. The fix here simply prohibits the aggregate
    node from producing duplicates, and it resolves the problem.
    
    Authored-by: Smyatkin Maxim (with minor edits by David Kimura)
---
 .../gporca/data/dxl/minidump/AggregateWithSkew.mdp |  2 +-
 .../dxl/minidump/DQA-GroupBy-HashAggregate1.mdp    |  2 +-
 .../dxl/minidump/DQA-GroupBy-HashAggregate2.mdp    |  2 +-
 .../DQA-InnerJoin-GroupBy-HashAggregate.mdp        |  2 +-
 .../gporca/data/dxl/minidump/DqaNoRedistribute.mdp | 10 ++--
 ...tiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp |  2 +-
 .../dxl/minidump/ScalarDQAWithNonScalarAgg.mdp     |  2 +-
 .../minidump/ThreeStageAgg-DistinctOnDistrCol.mdp  | 14 ++---
 ...eeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp | 14 ++---
 .../ThreeStageAgg-GbandDistinctOnDistrCol.mdp      | 12 ++--
 .../gporca/libgpopt/src/xforms/CXformSplitDQA.cpp  |  6 +-
 src/test/isolation2/expected/spilling_hashagg.out  | 68 ++++++++++++++++++++++
 .../expected/spilling_hashagg_optimizer.out        | 68 ++++++++++++++++++++++
 src/test/isolation2/isolation2_schedule            |  2 +
 src/test/isolation2/sql/spilling_hashagg.sql       | 49 ++++++++++++++++
 src/test/regress/expected/gp_dqa_optimizer.out     |  4 +-
 16 files changed, 222 insertions(+), 37 deletions(-)

diff --git a/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp 
b/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp
index ed063dbeb8..833ea3c7fc 100644
--- a/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp
+++ b/src/backend/gporca/data/dxl/minidump/AggregateWithSkew.mdp
@@ -707,7 +707,7 @@ group by b;
                 <dxl:Ident ColId="1" ColName="b" TypeMdid="0.23.1.0"/>
               </dxl:HashExpr>
             </dxl:HashExprList>
-            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
               <dxl:Properties>
                 <dxl:Cost StartupCost="0" TotalCost="715.260000" 
Rows="2000000.000000" Width="8"/>
               </dxl:Properties>
diff --git 
a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp 
b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp
index 67dee1e692..a6f524e37d 100644
--- a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp
+++ b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate1.mdp
@@ -287,7 +287,7 @@
               </dxl:ProjElem>
             </dxl:ProjList>
             <dxl:Filter/>
-            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
               <dxl:Properties>
                 <dxl:Cost StartupCost="0" TotalCost="431.000161" 
Rows="3.000000" Width="8"/>
               </dxl:Properties>
diff --git 
a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp 
b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp
index 6b842edca9..fd155c0d76 100644
--- a/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp
+++ b/src/backend/gporca/data/dxl/minidump/DQA-GroupBy-HashAggregate2.mdp
@@ -363,7 +363,7 @@
                   <dxl:Ident ColId="1" ColName="b" TypeMdid="0.23.1.0"/>
                 </dxl:HashExpr>
               </dxl:HashExprList>
-              <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+              <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
                 <dxl:Properties>
                   <dxl:Cost StartupCost="0" TotalCost="431.001880" 
Rows="20.000000" Width="8"/>
                 </dxl:Properties>
diff --git 
a/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp 
b/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp
index 79acfeb9cb..e7507cc47c 100644
--- 
a/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp
+++ 
b/src/backend/gporca/data/dxl/minidump/DQA-InnerJoin-GroupBy-HashAggregate.mdp
@@ -556,7 +556,7 @@
               </dxl:ProjElem>
             </dxl:ProjList>
             <dxl:Filter/>
-            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
               <dxl:Properties>
                 <dxl:Cost StartupCost="0" TotalCost="862.003697" 
Rows="10.000001" Width="16"/>
               </dxl:Properties>
diff --git a/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp 
b/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp
index 0697286479..afaf1f0bca 100644
--- a/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp
+++ b/src/backend/gporca/data/dxl/minidump/DqaNoRedistribute.mdp
@@ -707,7 +707,7 @@
     <dxl:Plan Id="0" SpaceSize="18">
       <dxl:GatherMotion InputSegments="0,1,2" OutputSegments="-1">
         <dxl:Properties>
-          <dxl:Cost StartupCost="0" TotalCost="431.972943" Rows="101.000000" 
Width="12"/>
+          <dxl:Cost StartupCost="0" TotalCost="431.965888" Rows="101.000000" 
Width="12"/>
         </dxl:Properties>
         <dxl:ProjList>
           <dxl:ProjElem ColId="0" Alias="a">
@@ -721,7 +721,7 @@
         <dxl:SortingColumnList/>
         <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
           <dxl:Properties>
-            <dxl:Cost StartupCost="0" TotalCost="431.968426" Rows="101.000000" 
Width="12"/>
+            <dxl:Cost StartupCost="0" TotalCost="431.961371" Rows="101.000000" 
Width="12"/>
           </dxl:Properties>
           <dxl:GroupingColumns>
             <dxl:GroupingColumn ColId="0"/>
@@ -744,7 +744,7 @@
           <dxl:Filter/>
           <dxl:RedistributeMotion InputSegments="0,1,2" OutputSegments="0,1,2">
             <dxl:Properties>
-              <dxl:Cost StartupCost="0" TotalCost="431.945220" 
Rows="568.125000" Width="8"/>
+              <dxl:Cost StartupCost="0" TotalCost="431.938165" 
Rows="568.125000" Width="8"/>
             </dxl:Properties>
             <dxl:ProjList>
               <dxl:ProjElem ColId="0" Alias="a">
@@ -761,9 +761,9 @@
                 <dxl:Ident ColId="0" ColName="a" TypeMdid="0.23.1.0"/>
               </dxl:HashExpr>
             </dxl:HashExprList>
-            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
               <dxl:Properties>
-                <dxl:Cost StartupCost="0" TotalCost="431.940478" 
Rows="568.125000" Width="8"/>
+                <dxl:Cost StartupCost="0" TotalCost="431.933423" 
Rows="568.125000" Width="8"/>
               </dxl:Properties>
               <dxl:GroupingColumns>
                 <dxl:GroupingColumn ColId="0"/>
diff --git 
a/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp
 
b/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp
index c75b21f9c4..7c7633fefd 100644
--- 
a/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp
+++ 
b/src/backend/gporca/data/dxl/minidump/MultiColumnDQA-InnerJoin-GroupBy-HashAggregate.mdp
@@ -503,7 +503,7 @@
               </dxl:ProjElem>
             </dxl:ProjList>
             <dxl:Filter/>
-            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+            <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
               <dxl:Properties>
                 <dxl:Cost StartupCost="0" TotalCost="862.004184" 
Rows="10.000001" Width="20"/>
               </dxl:Properties>
diff --git a/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp 
b/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp
index b2b82f9efc..bcfa49bb55 100644
--- a/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp
+++ b/src/backend/gporca/data/dxl/minidump/ScalarDQAWithNonScalarAgg.mdp
@@ -1496,7 +1496,7 @@
                   <dxl:Ident ColId="2" ColName="c" TypeMdid="0.23.1.0"/>
                 </dxl:HashExpr>
               </dxl:HashExprList>
-              <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+              <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
                 <dxl:Properties>
                   <dxl:Cost StartupCost="0" TotalCost="441.310000" 
Rows="100000.000000" Width="8"/>
                 </dxl:Properties>
diff --git 
a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp 
b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp
index f04c7e2717..352ecaccf5 100644
--- a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp
+++ b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-DistinctOnDistrCol.mdp
@@ -263,7 +263,7 @@
     <dxl:Plan Id="0" SpaceSize="38">
       <dxl:GatherMotion InputSegments="0,1" OutputSegments="-1">
         <dxl:Properties>
-          <dxl:Cost StartupCost="0" TotalCost="572.182073" Rows="4.000000" 
Width="8"/>
+          <dxl:Cost StartupCost="0" TotalCost="571.058696" Rows="4.000000" 
Width="8"/>
         </dxl:Properties>
         <dxl:ProjList>
           <dxl:ProjElem ColId="10" Alias="sum">
@@ -274,7 +274,7 @@
         <dxl:SortingColumnList/>
         <dxl:Result>
           <dxl:Properties>
-            <dxl:Cost StartupCost="0" TotalCost="572.181930" Rows="4.000000" 
Width="8"/>
+            <dxl:Cost StartupCost="0" TotalCost="571.058552" Rows="4.000000" 
Width="8"/>
           </dxl:Properties>
           <dxl:ProjList>
             <dxl:ProjElem ColId="10" Alias="sum">
@@ -285,7 +285,7 @@
           <dxl:OneTimeFilter/>
           <dxl:Aggregate AggregationStrategy="Sorted" StreamSafe="false">
             <dxl:Properties>
-              <dxl:Cost StartupCost="0" TotalCost="572.181930" Rows="4.000000" 
Width="8"/>
+              <dxl:Cost StartupCost="0" TotalCost="571.058552" Rows="4.000000" 
Width="8"/>
             </dxl:Properties>
             <dxl:GroupingColumns>
               <dxl:GroupingColumn ColId="1"/>
@@ -308,7 +308,7 @@
             <dxl:Filter/>
             <dxl:Sort SortDiscardDuplicates="false">
               <dxl:Properties>
-                <dxl:Cost StartupCost="0" TotalCost="572.181903" 
Rows="4.500000" Width="8"/>
+                <dxl:Cost StartupCost="0" TotalCost="571.058525" 
Rows="4.500000" Width="8"/>
               </dxl:Properties>
               <dxl:ProjList>
                 <dxl:ProjElem ColId="0" Alias="a">
@@ -326,7 +326,7 @@
               <dxl:LimitOffset/>
               <dxl:RedistributeMotion InputSegments="0,1" OutputSegments="0,1">
                 <dxl:Properties>
-                  <dxl:Cost StartupCost="0" TotalCost="572.181783" 
Rows="4.500000" Width="8"/>
+                  <dxl:Cost StartupCost="0" TotalCost="571.058406" 
Rows="4.500000" Width="8"/>
                 </dxl:Properties>
                 <dxl:ProjList>
                   <dxl:ProjElem ColId="0" Alias="a">
@@ -343,9 +343,9 @@
                     <dxl:Ident ColId="1" ColName="b" TypeMdid="0.23.1.0"/>
                   </dxl:HashExpr>
                 </dxl:HashExprList>
-                <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+                <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
                   <dxl:Properties>
-                    <dxl:Cost StartupCost="0" TotalCost="572.181727" 
Rows="4.500000" Width="8"/>
+                    <dxl:Cost StartupCost="0" TotalCost="571.058350" 
Rows="4.500000" Width="8"/>
                   </dxl:Properties>
                   <dxl:GroupingColumns>
                     <dxl:GroupingColumn ColId="1"/>
diff --git 
a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp
 
b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp
index c41f9b099f..7980a2280b 100644
--- 
a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp
+++ 
b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbMultipleCol-DistinctOnDistrCol.mdp
@@ -264,7 +264,7 @@
     <dxl:Plan Id="0" SpaceSize="38">
       <dxl:GatherMotion InputSegments="0,1" OutputSegments="-1">
         <dxl:Properties>
-          <dxl:Cost StartupCost="0" TotalCost="637.664977" Rows="11.250000" 
Width="8"/>
+          <dxl:Cost StartupCost="0" TotalCost="635.979922" Rows="11.250000" 
Width="8"/>
         </dxl:Properties>
         <dxl:ProjList>
           <dxl:ProjElem ColId="10" Alias="sum">
@@ -275,7 +275,7 @@
         <dxl:SortingColumnList/>
         <dxl:Result>
           <dxl:Properties>
-            <dxl:Cost StartupCost="0" TotalCost="637.664573" Rows="11.250000" 
Width="8"/>
+            <dxl:Cost StartupCost="0" TotalCost="635.979518" Rows="11.250000" 
Width="8"/>
           </dxl:Properties>
           <dxl:ProjList>
             <dxl:ProjElem ColId="10" Alias="sum">
@@ -286,7 +286,7 @@
           <dxl:OneTimeFilter/>
           <dxl:Aggregate AggregationStrategy="Sorted" StreamSafe="false">
             <dxl:Properties>
-              <dxl:Cost StartupCost="0" TotalCost="637.664573" 
Rows="11.250000" Width="8"/>
+              <dxl:Cost StartupCost="0" TotalCost="635.979518" 
Rows="11.250000" Width="8"/>
             </dxl:Properties>
             <dxl:GroupingColumns>
               <dxl:GroupingColumn ColId="1"/>
@@ -313,7 +313,7 @@
             <dxl:Filter/>
             <dxl:Sort SortDiscardDuplicates="false">
               <dxl:Properties>
-                <dxl:Cost StartupCost="0" TotalCost="637.664480" 
Rows="11.250000" Width="12"/>
+                <dxl:Cost StartupCost="0" TotalCost="635.979425" 
Rows="11.250000" Width="12"/>
               </dxl:Properties>
               <dxl:ProjList>
                 <dxl:ProjElem ColId="0" Alias="a">
@@ -335,7 +335,7 @@
               <dxl:LimitOffset/>
               <dxl:RedistributeMotion InputSegments="0,1" OutputSegments="0,1">
                 <dxl:Properties>
-                  <dxl:Cost StartupCost="0" TotalCost="637.663526" 
Rows="11.250000" Width="12"/>
+                  <dxl:Cost StartupCost="0" TotalCost="635.978472" 
Rows="11.250000" Width="12"/>
                 </dxl:Properties>
                 <dxl:ProjList>
                   <dxl:ProjElem ColId="0" Alias="a">
@@ -358,9 +358,9 @@
                     <dxl:Ident ColId="2" ColName="c" TypeMdid="0.23.1.0"/>
                   </dxl:HashExpr>
                 </dxl:HashExprList>
-                <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+                <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
                   <dxl:Properties>
-                    <dxl:Cost StartupCost="0" TotalCost="637.663315" 
Rows="11.250000" Width="12"/>
+                    <dxl:Cost StartupCost="0" TotalCost="635.978260" 
Rows="11.250000" Width="12"/>
                   </dxl:Properties>
                   <dxl:GroupingColumns>
                     <dxl:GroupingColumn ColId="1"/>
diff --git 
a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp
 
b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp
index a41741dae7..d4ff364734 100644
--- 
a/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp
+++ 
b/src/backend/gporca/data/dxl/minidump/ThreeStageAgg-GbandDistinctOnDistrCol.mdp
@@ -263,7 +263,7 @@
     <dxl:Plan Id="0" SpaceSize="11">
       <dxl:GatherMotion InputSegments="0,1" OutputSegments="-1">
         <dxl:Properties>
-          <dxl:Cost StartupCost="0" TotalCost="507.148781" Rows="2.000000" 
Width="8"/>
+          <dxl:Cost StartupCost="0" TotalCost="506.587091" Rows="2.000000" 
Width="8"/>
         </dxl:Properties>
         <dxl:ProjList>
           <dxl:ProjElem ColId="10" Alias="sum">
@@ -274,7 +274,7 @@
         <dxl:SortingColumnList/>
         <dxl:Result>
           <dxl:Properties>
-            <dxl:Cost StartupCost="0" TotalCost="507.148709" Rows="2.000000" 
Width="8"/>
+            <dxl:Cost StartupCost="0" TotalCost="506.587019" Rows="2.000000" 
Width="8"/>
           </dxl:Properties>
           <dxl:ProjList>
             <dxl:ProjElem ColId="10" Alias="sum">
@@ -285,7 +285,7 @@
           <dxl:OneTimeFilter/>
           <dxl:Aggregate AggregationStrategy="Sorted" StreamSafe="false">
             <dxl:Properties>
-              <dxl:Cost StartupCost="0" TotalCost="507.148709" Rows="2.000000" 
Width="8"/>
+              <dxl:Cost StartupCost="0" TotalCost="506.587019" Rows="2.000000" 
Width="8"/>
             </dxl:Properties>
             <dxl:GroupingColumns>
               <dxl:GroupingColumn ColId="0"/>
@@ -308,7 +308,7 @@
             <dxl:Filter/>
             <dxl:Sort SortDiscardDuplicates="false">
               <dxl:Properties>
-                <dxl:Cost StartupCost="0" TotalCost="507.148701" 
Rows="2.000000" Width="4"/>
+                <dxl:Cost StartupCost="0" TotalCost="506.587011" 
Rows="2.000000" Width="4"/>
               </dxl:Properties>
               <dxl:ProjList>
                 <dxl:ProjElem ColId="0" Alias="a">
@@ -321,9 +321,9 @@
               </dxl:SortingColumnList>
               <dxl:LimitCount/>
               <dxl:LimitOffset/>
-              <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="true">
+              <dxl:Aggregate AggregationStrategy="Hashed" StreamSafe="false">
                 <dxl:Properties>
-                  <dxl:Cost StartupCost="0" TotalCost="507.148701" 
Rows="2.000000" Width="4"/>
+                  <dxl:Cost StartupCost="0" TotalCost="506.587011" 
Rows="2.000000" Width="4"/>
                 </dxl:Properties>
                 <dxl:GroupingColumns>
                   <dxl:GroupingColumn ColId="0"/>
diff --git a/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp 
b/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp
index dbc325a33c..3567968736 100644
--- a/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp
+++ b/src/backend/gporca/libgpopt/src/xforms/CXformSplitDQA.cpp
@@ -738,13 +738,11 @@ CXformSplitDQA::PexprMultiLevelAggregation(
        CExpressionArray *pdrgpexprLastStage = pdrgpexprPrElSecondStage;
        if (fSplit2LevelsOnly)
        {
-               // for scalar DQA the local aggregate is responsible for 
removing duplicates
-               BOOL fLocalAggGeneratesDuplicates = (0 < 
pdrgpcrLastStage->Size());
-
+               // the local aggregate is responsible for removing duplicates
                pdrgpcrArgDQA->AddRef();
                popFirstStage = GPOS_NEW(mp) CLogicalGbAgg(
                        mp, pdrgpcrLocal, COperator::EgbaggtypeLocal,
-                       fLocalAggGeneratesDuplicates, pdrgpcrArgDQA, aggStage);
+                       false /* fGeneratesDuplicates */, pdrgpcrArgDQA, 
aggStage);
                pdrgpcrLastStage->AddRef();
                popSecondStage = GPOS_NEW(mp) CLogicalGbAgg(
                        mp, pdrgpcrLastStage, COperator::EgbaggtypeGlobal, /* 
egbaggtype */
diff --git a/src/test/isolation2/expected/spilling_hashagg.out 
b/src/test/isolation2/expected/spilling_hashagg.out
new file mode 100644
index 0000000000..8f01e1e241
--- /dev/null
+++ b/src/test/isolation2/expected/spilling_hashagg.out
@@ -0,0 +1,68 @@
+-- start_ignore
+-- end_ignore
+
+-- Test Orca properly removes duplicates in DQA
+-- (https://github.com/greenplum-db/gpdb/issues/14993)
+
+-- GPDB_12_MERGE_FEATURE_NOT_SUPPORTED: After streaming hash aggregates are
+-- supported then add a fault injection for 'force_hashagg_stream_hashtable'.
+-- Until then this test doesn't actually test spilling.
+
+CREATE TABLE test_src_tbl AS WITH cte1 AS ( SELECT field5 from 
generate_series(1,1000) field5 ) SELECT field5 % 100 AS a, field5 % 100  + 1 AS 
b FROM cte1 DISTRIBUTED BY (a);
+CREATE 1000
+ANALYZE test_src_tbl;
+ANALYZE
+
+
+-- Use isolation2 framework to force a streaming hash aggregate to clear the
+-- hash table and stream tuples to next stage aggregate. This is to simulate
+-- hash table spills after 100 tuples inserted any segment.
+SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'skip', '', '', '', 
100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p';
+ gp_inject_fault 
+-----------------
+ Success:        
+ Success:        
+ Success:        
+ Success:        
+(4 rows)
+CREATE TABLE test_hashagg_on AS SELECT a, COUNT(DISTINCT b) AS b FROM 
test_src_tbl GROUP BY a;
+CREATE 100
+EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP 
BY a;
+ QUERY PLAN                                 
+--------------------------------------------
+ Gather Motion 3:1  (slice1; segments: 3)   
+   ->  HashAggregate                        
+         Group Key: a                       
+         ->  HashAggregate                  
+               Group Key: a, b              
+               ->  Seq Scan on test_src_tbl 
+ Optimizer: Postgres query optimizer        
+(7 rows)
+
+-- Compare results against a group aggregate plan.
+set optimizer_enable_hashagg=off;
+SET
+CREATE TABLE test_hashagg_off AS SELECT a, COUNT(DISTINCT b) AS b FROM 
test_src_tbl GROUP BY a;
+CREATE 100
+EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP 
BY a;
+ QUERY PLAN                                 
+--------------------------------------------
+ Gather Motion 3:1  (slice1; segments: 3)   
+   ->  HashAggregate                        
+         Group Key: a                       
+         ->  HashAggregate                  
+               Group Key: a, b              
+               ->  Seq Scan on test_src_tbl 
+ Optimizer: Postgres query optimizer        
+(7 rows)
+
+-- Results should match
+SELECT (n_total=n_matches) AS match FROM ( SELECT COUNT(*) n_total, SUM(CASE 
WHEN t1.b = t2.b THEN 1 ELSE 0 END) n_matches FROM test_hashagg_on t1 JOIN 
test_hashagg_off t2 ON t1.a = t2.a) t;
+ match 
+-------
+ t     
+(1 row)
+
+
+-- start_ignore
+-- end_ignore
diff --git a/src/test/isolation2/expected/spilling_hashagg_optimizer.out 
b/src/test/isolation2/expected/spilling_hashagg_optimizer.out
new file mode 100644
index 0000000000..8f0f54e200
--- /dev/null
+++ b/src/test/isolation2/expected/spilling_hashagg_optimizer.out
@@ -0,0 +1,68 @@
+-- start_ignore
+-- end_ignore
+
+-- Test Orca properly removes duplicates in DQA
+-- (https://github.com/greenplum-db/gpdb/issues/14993)
+
+-- GPDB_12_MERGE_FEATURE_NOT_SUPPORTED: After streaming hash aggregates are
+-- supported then add a fault injection for 'force_hashagg_stream_hashtable'.
+-- Until then this test doesn't actually test spilling.
+
+CREATE TABLE test_src_tbl AS WITH cte1 AS ( SELECT field5 from 
generate_series(1,1000) field5 ) SELECT field5 % 100 AS a, field5 % 100  + 1 AS 
b FROM cte1 DISTRIBUTED BY (a);
+CREATE 1000
+ANALYZE test_src_tbl;
+ANALYZE
+
+
+-- Use isolation2 framework to force a streaming hash aggregate to clear the
+-- hash table and stream tuples to next stage aggregate. This is to simulate
+-- hash table spills after 100 tuples inserted any segment.
+SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'skip', '', '', '', 
100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p';
+ gp_inject_fault 
+-----------------
+ Success:        
+ Success:        
+ Success:        
+ Success:        
+(4 rows)
+CREATE TABLE test_hashagg_on AS SELECT a, COUNT(DISTINCT b) AS b FROM 
test_src_tbl GROUP BY a;
+CREATE 100
+EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP 
BY a;
+ QUERY PLAN                                 
+--------------------------------------------
+ Gather Motion 3:1  (slice1; segments: 3)   
+   ->  HashAggregate                        
+         Group Key: a                       
+         ->  HashAggregate                  
+               Group Key: a, b              
+               ->  Seq Scan on test_src_tbl 
+ Optimizer: Pivotal Optimizer (GPORCA)      
+(7 rows)
+
+-- Compare results against a group aggregate plan.
+set optimizer_enable_hashagg=off;
+SET
+CREATE TABLE test_hashagg_off AS SELECT a, COUNT(DISTINCT b) AS b FROM 
test_src_tbl GROUP BY a;
+CREATE 100
+EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP 
BY a;
+ QUERY PLAN                                 
+--------------------------------------------
+ Gather Motion 3:1  (slice1; segments: 3)   
+   ->  GroupAggregate                       
+         Group Key: a                       
+         ->  Sort                           
+               Sort Key: a                  
+               ->  Seq Scan on test_src_tbl 
+ Optimizer: Pivotal Optimizer (GPORCA)      
+(7 rows)
+
+-- Results should match
+SELECT (n_total=n_matches) AS match FROM ( SELECT COUNT(*) n_total, SUM(CASE 
WHEN t1.b = t2.b THEN 1 ELSE 0 END) n_matches FROM test_hashagg_on t1 JOIN 
test_hashagg_off t2 ON t1.a = t2.a) t;
+ match 
+-------
+ t     
+(1 row)
+
+
+-- start_ignore
+-- end_ignore
diff --git a/src/test/isolation2/isolation2_schedule 
b/src/test/isolation2/isolation2_schedule
index 588fd326e5..5b24284ab6 100644
--- a/src/test/isolation2/isolation2_schedule
+++ b/src/test/isolation2/isolation2_schedule
@@ -311,6 +311,8 @@ test: aocs_unique_index
 test: uao/ao_unique_index_vacuum_row
 test: uao/ao_unique_index_vacuum_column
 
+test: spilling_hashagg
+
 test: local_directory_table_mixed
 test: stat_activity_extended
 test: sharedsnapshot
diff --git a/src/test/isolation2/sql/spilling_hashagg.sql 
b/src/test/isolation2/sql/spilling_hashagg.sql
new file mode 100644
index 0000000000..eaa1564964
--- /dev/null
+++ b/src/test/isolation2/sql/spilling_hashagg.sql
@@ -0,0 +1,49 @@
+-- start_ignore
+CREATE EXTENSION IF NOT EXISTS gp_inject_fault;
+DROP TABLE IF EXISTS test_src_tbl;
+DROP TABLE IF EXISTS test_hashagg_on;
+DROP TABLE IF EXISTS test_hashagg_off;
+-- end_ignore
+
+-- Test Orca properly removes duplicates in DQA
+-- (https://github.com/greenplum-db/gpdb/issues/14993)
+
+-- GPDB_12_MERGE_FEATURE_NOT_SUPPORTED: After streaming hash aggregates are
+-- supported then add a fault injection for 'force_hashagg_stream_hashtable'.
+-- Until then this test doesn't actually test spilling.
+
+CREATE TABLE test_src_tbl AS
+WITH cte1 AS (
+    SELECT field5 from generate_series(1,1000) field5
+)
+SELECT field5 % 100 AS a, field5 % 100  + 1 AS b
+FROM cte1 DISTRIBUTED BY (a);
+ANALYZE test_src_tbl;
+
+
+-- Use isolation2 framework to force a streaming hash aggregate to clear the
+-- hash table and stream tuples to next stage aggregate. This is to simulate
+-- hash table spills after 100 tuples inserted any segment.
+SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'skip', '', '', '', 
100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p';
+CREATE TABLE test_hashagg_on AS
+SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a;
+EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP 
BY a;
+
+-- Compare results against a group aggregate plan.
+set optimizer_enable_hashagg=off;
+CREATE TABLE test_hashagg_off AS
+SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP BY a;
+EXPLAIN (costs off) SELECT a, COUNT(DISTINCT b) AS b FROM test_src_tbl GROUP 
BY a;
+
+-- Results should match
+SELECT (n_total=n_matches) AS match FROM (
+SELECT COUNT(*) n_total, SUM(CASE WHEN t1.b = t2.b THEN 1 ELSE 0 END) n_matches
+FROM test_hashagg_on t1
+JOIN test_hashagg_off t2 ON t1.a = t2.a) t;
+
+
+-- start_ignore
+SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'status', '', '', '', 
100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p';
+SELECT gp_inject_fault('force_hashagg_stream_hashtable', 'reset', '', '', '', 
100, 100, 0, dbid) FROM gp_segment_configuration WHERE role='p';
+RESET ALL;
+-- end_ignore
diff --git a/src/test/regress/expected/gp_dqa_optimizer.out 
b/src/test/regress/expected/gp_dqa_optimizer.out
index b73c793c1d..71ad1cde46 100644
--- a/src/test/regress/expected/gp_dqa_optimizer.out
+++ b/src/test/regress/expected/gp_dqa_optimizer.out
@@ -2841,7 +2841,7 @@ explain (verbose on, costs off)select sum(Distinct a), 
count(b), sum(c) from dqa
          ->  Redistribute Motion 3:3  (slice2; segments: 3)
                Output: a, e, (PARTIAL count(b)), (PARTIAL sum(c))
                Hash Key: e
-               ->  Streaming Partial HashAggregate
+               ->  Partial HashAggregate
                      Output: a, e, PARTIAL count(b), PARTIAL sum(c)
                      Group Key: dqa_f3.e, dqa_f3.a
                      ->  Seq Scan on public.dqa_f3
@@ -2870,7 +2870,7 @@ explain (verbose on, costs off) select sum(Distinct e), 
count(b), sum(c) from dq
    ->  Finalize HashAggregate
          Output: sum(e), count(b), sum(c)
          Group Key: dqa_f3.a
-         ->  Streaming Partial HashAggregate
+         ->  Partial HashAggregate
                Output: PARTIAL count(b), PARTIAL sum(c), a, e
                Group Key: dqa_f3.a, dqa_f3.e
                ->  Seq Scan on public.dqa_f3


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(cloudberry) 01/43: Force two-stage local aggregate to remove duplicates

Reply via email to