vbarua opened a new issue, #17910:
URL: https://github.com/apache/datafusion/issues/17910

   ### Describe the bug
   
   
   In the presences of multiple grouping sets, the [output 
schema](https://substrait.io/relations/logical_relations/#aggregate-operation) 
for a AggregateRel consists of:
   1. All expressions in the groupingExpressions field
   2. All measures
   3. An extra i32 column to indicate which grouping a row belongs to.
   
   In Datafusion, the default order of the schema seems to be
   1. All expressions in the groupingExpressions field
   2. An extra i32 column to indicate which grouping a row belongs to.
   3. All measures
   
   This different in schemas does not appear to be handled in the consumer.
   
   ### To Reproduce
   
   The following Substrait plan
   ```json
   {
     "extensionUris": [
       {
         "extensionUriAnchor": 1,
         "uri": 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml";
       }
     ],
     "extensions": [
       {
         "extensionFunction": {
           "extensionUriReference": 1,
           "functionAnchor": 1,
           "name": "sum:i8"
         }
       }
     ],
     "relations": [
       {
         "root": {
           "input": {
             "aggregate": {
               "common": {
                 "emit": {
                   "outputMapping": [
                     0, 1, 2
                   ]
                 }
               },
               "input": {
                 "read": {
                   "baseSchema": {
                     "names": [
                       "c0",
                       "c1"
                     ],
                     "struct": {
                       "nullability": "NULLABILITY_REQUIRED",
                       "types": [
                         {
                           "i8": {
                             "nullability": "NULLABILITY_NULLABLE"
                           }
                         },
                         {
                           "i8": {
                             "nullability": "NULLABILITY_NULLABLE"
                           }
                         }
                       ]
                     }
                   },
                   "common": {
                     "direct": {}
                   },
                   "virtualTable": {}
                 }
               },
               "groupingExpressions": [
                 {
                   "selection": {
                     "directReference": {
                       "structField": {}
                     },
                     "rootReference": {}
                   }
                 },
                 {
                   "selection": {
                     "directReference": {
                       "structField": {
                         "field": 1
                       }
                     },
                     "rootReference": {}
                   }
                 }
               ],
               "groupings": [
                 {
                   "expressionReferences": [0]
                 },
                 {
                   "expressionReferences": [1]
                 },
                 {
                   "expressionReferences": [0, 1]
                 }
               ],
               "measures": [
                 {
                   "measure": {
                     "arguments": [
                       {
                         "value": {
                           "selection": {
                             "directReference": {
                               "structField": {}
                             },
                             "rootReference": {}
                           }
                         }
                       }
                     ],
                     "functionReference": 1,
                     "invocation": "AGGREGATION_INVOCATION_ALL",
                     "outputType": {
                       "i8": {
                         "nullability": "NULLABILITY_NULLABLE"
                       }
                     },
                     "phase": "AGGREGATION_PHASE_INITIAL_TO_RESULT"
                   }
                 }
               ]
             }
           },
           "names": [
             "c0",
             "c1",
             "summation"
           ]
         }
       }
     ],
     "version": {
       "minorNumber": 29
     }
   }
   ```
   
   which captures a query like
   ```sql
   SELECT c0, c1, sum(0) AS summation
   FROM <some table>
   GROUP BY GROUPING SETS((c0), (c1), (c0, c1))
   ```
   yields the following DataFusion plan
   ```
   Projection: c0, c1, __grouping_id AS summation
     Aggregate: groupBy=[[GROUPING SETS ((c0), (c1), (c0, c1))]], 
aggr=[[sum(c0)]]
       EmptyRelation: rows=0
   ```
   
   With this plan, output column 2 should correspond to the summation, but in 
the Datafusion plan it is the `__grouping_id` column.
   
   ### Expected behavior
   
   We would expect something like instead
   ```
   Projection: c0, c1, sum(c0) AS summation
     Aggregate: groupBy=[[GROUPING SETS ((c0), (c1), (c0, c1))]], 
aggr=[[sum(c0)]]
       EmptyRelation: rows=0
   ```
   
   ### Additional context
   
   I suspect that fixing this in the consumer will also reveal bugs in the 
producer around handling for multiple groupings sets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to