SharuBob opened a new issue #11603:
URL: https://github.com/apache/druid/issues/11603
Druid SQL: Query with Count(*) is not returning actual number of rows with
HAVING clause on derived columns
### Affected Version
0.21.0
### Description
**We are using Druid for Ads Analytics and reporting. Storing all the
advertisement dimensions and metrics, such as advertiserID, campaignID, clicks,
impressions etc.**
_I'm using Druid SQL for Querying not the native Query:_
The use case is for supporting pagination, we need total count of the rows
returned so can use DRUID SQL LIMIT and OFFSET for pagination.
In order to get the total count for the given query we are using `SELECT
COUNT(*) FROM (SUBQUERY) SOURCE`
**Issue: The total count returned is not the actual number of rows for the
actual query.**
**### Example:**
The below Query would return total 6 rows,
`SELECT advertiser_id, campaign_id, city_code, matched_keywords,
DATE_TRUNC('year', __time) AS Year_,
SUM(impression) AS impression, SUM(click) AS click, (CAST(SUM(click) as
FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_,
COUNT(*) AS counts FROM ad_fact_v1 WHERE __time >= '2018-09-01T18:05:39Z'
AND __time < '2021-08-16T18:05:39Z' AND
advertiser_id IN ('290641618220892166') AND campaign_id IN
('347205814932489837') GROUP BY 1, 2, 3, 4, 5 HAVING click >= 100 AND CTR_ <=
4.90 ORDER BY CTR_ ASC`
Equivalent native query using explain:
`{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "ad_fact_v1"
},
"intervals": {
"type": "intervals",
"intervals": [
"2018-09-01T18:05:39.000Z/2021-08-16T18:05:39.000Z"
]
},
"virtualColumns": [{
"type": "expression",
"name": "v2",
"expression": "timestamp_floor(\"__time\",'P1Y',null,'UTC')",
"outputType": "LONG"
}],
"filter": {
"type": "and",
"fields": [{
"type": "selector",
"dimension": "advertiser_id",
"value": "290641618220892166",
"extractionFn": null
},
{
"type": "selector",
"dimension": "campaign_id",
"value": "347205814932489837",
"extractionFn": null
}
]
},
"granularity": {
"type": "all"
},
"dimensions": [{
"type": "default",
"dimension": "city_code",
"outputName": "d2",
"outputType": "STRING"
},
{
"type": "default",
"dimension": "matched_keywords",
"outputName": "d3",
"outputType": "STRING"
},
{
"type": "default",
"dimension": "v2",
"outputName": "d4",
"outputType": "LONG"
}
],
"aggregations": [{
"type": "longSum",
"name": "a0",
"fieldName": "impression",
"expression": null
},
{
"type": "longSum",
"name": "a1",
"fieldName": "click",
"expression": null
},
{
"type": "count",
"name": "a2"
}
],
"postAggregations": [{
"type": "expression",
"name": "p0",
"expression": "'290641618220892166'",
"ordering": null
},
{
"type": "expression",
"name": "p1",
"expression": "'347205814932489837'",
"ordering": null
},
{
"type": "expression",
"name": "p2",
"expression": "((CAST(\"a1\", 'DOUBLE') / CAST(\"a0\",
'DOUBLE')) * 100)",
"ordering": null
}
],
"having": {
"type": "filter",
"filter": {
"type": "and",
"fields": [{
"type": "bound",
"dimension": "a1",
"lower": "100",
"upper": null,
"lowerStrict": false,
"upperStrict": false,
"extractionFn": null,
"ordering": {
"type": "numeric"
}
},
{
"type": "expression",
"expression": "(((CAST(\"a1\",
'DOUBLE') / CAST(\"a0\", 'DOUBLE')) * 100) <= 4.90)"
}
]
},
"finalize": true
},
"limitSpec": {
"type": "default",
"columns": [{
"dimension": "p2",
"direction": "ascending",
"dimensionOrder": {
"type": "numeric"
}
}],
"limit": 100
},
"context": {
"sqlOuterLimit": 100,
"sqlQueryId": "39b9554f-e4d6-4b80-b106-71bba7f17d05"
},
"descending": false
}`
**So, when I use below query expect the count to be 6. But it seems not
apply HAVING clause here when fetching count.**
`SELECT COUNT(*) AS totalCounts FROM (SELECT advertiser_id, campaign_id,
city_code, matched_keywords, DATE_TRUNC('year', __time) AS Year_,
SUM(impression) AS impression, SUM(click) AS click, (CAST(SUM(click) as
FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_,
COUNT(*) AS counts FROM ad_fact_v1 WHERE __time >= '2018-09-01T18:05:39Z'
AND __time < '2021-08-16T18:05:39Z' AND
advertiser_id IN ('290641618220892166') AND campaign_id IN
('347205814932489837') GROUP BY 1, 2, 3, 4, 5 HAVING click >= 100 AND CTR_ <=
4.90 ORDER BY CTR_ ASC) source`
Equivalent native query using explain:
`DruidOuterQueryRel(query = [{
"queryType": "timeseries",
"dataSource": {
"type": "table",
"name": "__subquery__"
},
"intervals": {
"type": "intervals",
"intervals":
["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
},
"descending": false,
"virtualColumns": [{
"type": "expression",
"name": "v0",
"expression": "((CAST(\"click\", 'DOUBLE') /
CAST(\"impression\", 'DOUBLE')) * 100)",
"outputType": "FLOAT"
}],
"filter": {
"type": "and",
"fields": [{
"type": "bound",
"dimension": "click",
"lower": "100",
"upper": null,
"lowerStrict": false,
"upperStrict": false,
"extractionFn": null,
"ordering": {
"type": "numeric"
}
}, {
"type": "bound",
"dimension": "v0",
"lower": null,
"upper": "4.90",
"lowerStrict": false,
"upperStrict": false,
"extractionFn": null,
"ordering": {
"type": "numeric"
}
}]
},
"granularity": {
"type": "all"
},
"aggregations": [{
"type": "count",
"name": "a0"
}],
"postAggregations": [],
"limit": 2147483647,
"context": {
"skipEmptyBuckets": true,
"sqlOuterLimit": 100,
"sqlQueryId": "c35534f5-38dc-4747-8812-36d2c063dd74"
}
}], signature = [{
a0: LONG
}]) DruidQueryRel(query = [{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "ad_fact_v1"
},
"intervals": {
"type": "intervals",
"intervals":
["2018-09-01T18:05:39.000Z/2021-08-16T18:05:39.000Z"]
},
"virtualColumns": [{
"type": "expression",
"name": "v0",
"expression": "'290641618220892166'",
"outputType": "STRING"
}, {
"type": "expression",
"name": "v1",
"expression": "'347205814932489837'",
"outputType": "STRING"
}, {
"type": "expression",
"name": "v2",
"expression": "timestamp_floor(\"__time\",'P1Y',null,'UTC')",
"outputType": "LONG"
}],
"filter": {
"type": "and",
"fields": [{
"type": "selector",
"dimension": "advertiser_id",
"value": "290641618220892166",
"extractionFn": null
}, {
"type": "selector",
"dimension": "campaign_id",
"value": "347205814932489837",
"extractionFn": null
}]
},
"granularity": {
"type": "all"
},
"dimensions": [{
"type": "default",
"dimension": "v0",
"outputName": "d0",
"outputType": "STRING"
}, {
"type": "default",
"dimension": "v1",
"outputName": "d1",
"outputType": "STRING"
}, {
"type": "default",
"dimension": "city_code",
"outputName": "d2",
"outputType": "STRING"
}, {
"type": "default",
"dimension": "matched_keywords",
"outputName": "d3",
"outputType": "STRING"
}, {
"type": "default",
"dimension": "v2",
"outputName": "d4",
"outputType": "LONG"
}],
"aggregations": [{
"type": "longSum",
"name": "a0",
"fieldName": "impression",
"expression": null
}, {
"type": "longSum",
"name": "a1",
"fieldName": "click",
"expression": null
}],
"postAggregations": [],
"having": null,
"limitSpec": {
"type": "NoopLimitSpec"
},
"context": {
"sqlOuterLimit": 100,
"sqlQueryId": "c35534f5-38dc-4747-8812-36d2c063dd74"
},
"descending": false
}], signature = [{
d0: STRING,
d1: STRING,
d2: STRING,
d3: STRING,
d4: LONG,
a0: LONG,
a1: LONG
}])`
Debug Done:
- Looks like derived aggregation like **(CAST(SUM(click) as
FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_** is not considered while
doing the count(*).
- count(*) is using _"queryType": "timeseries"_ where as without count(*)
uses _"queryType": "groupBy"_. Is that possible reason?
Any help would be highly appreciated, thank you in advance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]