a2l007 opened a new issue, #13204:
URL: https://github.com/apache/druid/issues/13204
### Affected Version
Tested on 0.22, 0.23 and 24.0
### Description
Consider the following SQL query:
```
select sum(added) AS added_sum, countryName from wikipedia
where cityName = 'Cilacap'
and countryName IN ('Indonesia')
GROUP BY GROUPING SETS ( (FLOOR(__time TO MONTH), countryName),
(FLOOR(__time TO MONTH)) )
```
The grouping set dimension `countryName` is also included in the WHERE
clause filtering on a single value. The result for this query ends up being:
```
added_sum | countryName
8 | Indonesia
8 | Indonesia
```
instead of:
```
added_sum | countryName
8 | Indonesia
8 | null
```
Since the second subtotal group doesn't include `countryName` , the second
result row should have this value as null. The aggregate results are correct
either ways, but formatting the query results can be confusing since
`COALESCE(countryName, 'TOTAL')` as countryName does not work here.
### Analysis
The native query plan for this query is:
```
[
{
"query": {
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "wikipedia"
},
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"virtualColumns": [
{
"type": "expression",
"name": "v0",
"expression":
"timestamp_floor(\"__time\",'P1M',null,'UTC')",
"outputType": "LONG"
}
],
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "cityName",
"value": "Cilacap"
},
{
"type": "selector",
"dimension": "countryName",
"value": "Indonesia"
}
]
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "v0",
"outputName": "d0",
"outputType": "LONG"
}
],
"aggregations": [
{
"type": "longSum",
"name": "a0",
"fieldName": "added"
}
],
"postAggregations": [
{
"type": "expression",
"name": "p0",
"expression": "'Indonesia'"
}
],
"limitSpec": {
"type": "default",
"columns": [],
"limit": 1001
},
"subtotalsSpec": [
[
"d0"
],
[
"d0"
]
],
"context": {
"sqlOuterLimit": 1001,
"sqlQueryId": "52fce1a4-7d39-4c02-aad1-29cec2c2e3c3",
"timestampResultField": "d0",
"timestampResultFieldGranularity": "MONTH",
"timestampResultFieldInOriginalDimensions": 0
}
},
"signature": [
{
"name": "a0",
"type": "LONG"
},
{
"name": "p0",
"type": "STRING"
}
]
}
]
```
It can be seen that during the planning phase, the `countryName` dimension
was converted into a post aggregator and removed from the dimensions list. As a
result, the subtotal grouping ends up grouping on `__time` in both cases.
This issue does not arise if there are more than one values in the filter
list for `countryName` since this dimension can no longer be converted to a
post agg and thus remains in the dimension list.
One possible fix that I'm currently evaluating is to skip dropping the
dimension
[here](https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/Grouping.java#L240)
if there is a subtotal spec existing for that dimension and modify the row
signature accordingly. Unless there are any better ways to fix this issue, I
can send out the PR after adequate testing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]