[GitHub] [druid] SharuBob opened a new issue #11603: Druid SQL: Query with Count(*) is not returning actual number of rows with HAVING clause on derived columns

GitBox Mon, 16 Aug 2021 12:07:40 -0700


SharuBob opened a new issue #11603:
URL: https://github.com/apache/druid/issues/11603



   Druid SQL: Query with Count(*) is not returning actual number of rows with 
HAVING clause on derived columns
   
   ### Affected Version
   0.21.0
   
   ### Description
   
   **We are using Druid for Ads Analytics and reporting. Storing all the 
advertisement dimensions and metrics, such as advertiserID, campaignID, clicks, 
impressions etc.**
   
   _I'm using Druid SQL for Querying not the native Query:_
   The use case is for supporting pagination, we need total count of the rows 
returned so can use DRUID SQL LIMIT and OFFSET for pagination.
   
   In order to get the total count for the given query we are using `SELECT 
COUNT(*) FROM (SUBQUERY) SOURCE`
   
   **Issue: The total count returned is not the actual number of rows for the 
actual query.**
   
   **### Example:**
   
   The below Query would return total 6 rows, 
   
   `SELECT advertiser_id, campaign_id, city_code, matched_keywords, 
DATE_TRUNC('year', __time) AS Year_, 
   SUM(impression) AS impression, SUM(click) AS click, (CAST(SUM(click) as 
FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_, 
   COUNT(*) AS counts FROM ad_fact_v1 WHERE __time >= '2018-09-01T18:05:39Z' 
AND __time < '2021-08-16T18:05:39Z' AND 
   advertiser_id IN ('290641618220892166') AND campaign_id IN 
('347205814932489837') GROUP BY 1, 2, 3, 4, 5 HAVING click >= 100 AND CTR_ <= 
4.90 ORDER BY CTR_ ASC`
   
   Equivalent native query using explain:
   
   `{
        "queryType": "groupBy",
        "dataSource": {
                "type": "table",
                "name": "ad_fact_v1"
        },
        "intervals": {
                "type": "intervals",
                "intervals": [
                        "2018-09-01T18:05:39.000Z/2021-08-16T18:05:39.000Z"
                ]
        },
        "virtualColumns": [{
                "type": "expression",
                "name": "v2",
                "expression": "timestamp_floor(\"__time\",'P1Y',null,'UTC')",
                "outputType": "LONG"
        }],
        "filter": {
                "type": "and",
                "fields": [{
                                "type": "selector",
                                "dimension": "advertiser_id",
                                "value": "290641618220892166",
                                "extractionFn": null
                        },
                        {
                                "type": "selector",
                                "dimension": "campaign_id",
                                "value": "347205814932489837",
                                "extractionFn": null
                        }
                ]
        },
        "granularity": {
                "type": "all"
        },
        "dimensions": [{
                        "type": "default",
                        "dimension": "city_code",
                        "outputName": "d2",
                        "outputType": "STRING"
                },
                {
                        "type": "default",
                        "dimension": "matched_keywords",
                        "outputName": "d3",
                        "outputType": "STRING"
                },
                {
                        "type": "default",
                        "dimension": "v2",
                        "outputName": "d4",
                        "outputType": "LONG"
                }
        ],
        "aggregations": [{
                        "type": "longSum",
                        "name": "a0",
                        "fieldName": "impression",
                        "expression": null
                },
                {
                        "type": "longSum",
                        "name": "a1",
                        "fieldName": "click",
                        "expression": null
                },
                {
                        "type": "count",
                        "name": "a2"
                }
        ],
        "postAggregations": [{
                        "type": "expression",
                        "name": "p0",
                        "expression": "'290641618220892166'",
                        "ordering": null
                },
                {
                        "type": "expression",
                        "name": "p1",
                        "expression": "'347205814932489837'",
                        "ordering": null
                },
                {
                        "type": "expression",
                        "name": "p2",
                        "expression": "((CAST(\"a1\", 'DOUBLE') / CAST(\"a0\", 
'DOUBLE')) * 100)",
                        "ordering": null
                }
        ],
        "having": {
                "type": "filter",
                "filter": {
                        "type": "and",
                        "fields": [{
                                        "type": "bound",
                                        "dimension": "a1",
                                        "lower": "100",
                                        "upper": null,
                                        "lowerStrict": false,
                                        "upperStrict": false,
                                        "extractionFn": null,
                                        "ordering": {
                                                "type": "numeric"
                                        }
                                },
                                {
                                        "type": "expression",
                                        "expression": "(((CAST(\"a1\", 
'DOUBLE') / CAST(\"a0\", 'DOUBLE')) * 100) <= 4.90)"
                                }
                        ]
                },
                "finalize": true
        },
        "limitSpec": {
                "type": "default",
                "columns": [{
                        "dimension": "p2",
                        "direction": "ascending",
                        "dimensionOrder": {
                                "type": "numeric"
                        }
                }],
                "limit": 100
        },
        "context": {
                "sqlOuterLimit": 100,
                "sqlQueryId": "39b9554f-e4d6-4b80-b106-71bba7f17d05"
        },
        "descending": false
   }`
   
   **So, when I use below query expect the count to be 6. But it seems not 
apply HAVING clause here when fetching count.**
   
   `SELECT COUNT(*) AS totalCounts FROM (SELECT advertiser_id, campaign_id, 
city_code, matched_keywords, DATE_TRUNC('year', __time) AS Year_, 
   SUM(impression) AS impression, SUM(click) AS click, (CAST(SUM(click) as 
FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_, 
   COUNT(*) AS counts FROM ad_fact_v1 WHERE __time >= '2018-09-01T18:05:39Z' 
AND __time < '2021-08-16T18:05:39Z' AND 
   advertiser_id IN ('290641618220892166') AND campaign_id IN 
('347205814932489837') GROUP BY 1, 2, 3, 4, 5 HAVING click >= 100 AND CTR_ <= 
4.90 ORDER BY CTR_ ASC) source`
   
   Equivalent native query using explain:
   
   `DruidOuterQueryRel(query = [{
        "queryType": "timeseries",
        "dataSource": {
                "type": "table",
                "name": "__subquery__"
        },
        "intervals": {
                "type": "intervals",
                "intervals": 
["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
        },
        "descending": false,
        "virtualColumns": [{
                "type": "expression",
                "name": "v0",
                "expression": "((CAST(\"click\", 'DOUBLE') / 
CAST(\"impression\", 'DOUBLE')) * 100)",
                "outputType": "FLOAT"
        }],
        "filter": {
                "type": "and",
                "fields": [{
                        "type": "bound",
                        "dimension": "click",
                        "lower": "100",
                        "upper": null,
                        "lowerStrict": false,
                        "upperStrict": false,
                        "extractionFn": null,
                        "ordering": {
                                "type": "numeric"
                        }
                }, {
                        "type": "bound",
                        "dimension": "v0",
                        "lower": null,
                        "upper": "4.90",
                        "lowerStrict": false,
                        "upperStrict": false,
                        "extractionFn": null,
                        "ordering": {
                                "type": "numeric"
                        }
                }]
        },
        "granularity": {
                "type": "all"
        },
        "aggregations": [{
                "type": "count",
                "name": "a0"
        }],
        "postAggregations": [],
        "limit": 2147483647,
        "context": {
                "skipEmptyBuckets": true,
                "sqlOuterLimit": 100,
                "sqlQueryId": "c35534f5-38dc-4747-8812-36d2c063dd74"
        }
   }], signature = [{
        a0: LONG
   }]) DruidQueryRel(query = [{
        "queryType": "groupBy",
        "dataSource": {
                "type": "table",
                "name": "ad_fact_v1"
        },
        "intervals": {
                "type": "intervals",
                "intervals": 
["2018-09-01T18:05:39.000Z/2021-08-16T18:05:39.000Z"]
        },
        "virtualColumns": [{
                "type": "expression",
                "name": "v0",
                "expression": "'290641618220892166'",
                "outputType": "STRING"
        }, {
                "type": "expression",
                "name": "v1",
                "expression": "'347205814932489837'",
                "outputType": "STRING"
        }, {
                "type": "expression",
                "name": "v2",
                "expression": "timestamp_floor(\"__time\",'P1Y',null,'UTC')",
                "outputType": "LONG"
        }],
        "filter": {
                "type": "and",
                "fields": [{
                        "type": "selector",
                        "dimension": "advertiser_id",
                        "value": "290641618220892166",
                        "extractionFn": null
                }, {
                        "type": "selector",
                        "dimension": "campaign_id",
                        "value": "347205814932489837",
                        "extractionFn": null
                }]
        },
        "granularity": {
                "type": "all"
        },
        "dimensions": [{
                "type": "default",
                "dimension": "v0",
                "outputName": "d0",
                "outputType": "STRING"
        }, {
                "type": "default",
                "dimension": "v1",
                "outputName": "d1",
                "outputType": "STRING"
        }, {
                "type": "default",
                "dimension": "city_code",
                "outputName": "d2",
                "outputType": "STRING"
        }, {
                "type": "default",
                "dimension": "matched_keywords",
                "outputName": "d3",
                "outputType": "STRING"
        }, {
                "type": "default",
                "dimension": "v2",
                "outputName": "d4",
                "outputType": "LONG"
        }],
        "aggregations": [{
                "type": "longSum",
                "name": "a0",
                "fieldName": "impression",
                "expression": null
        }, {
                "type": "longSum",
                "name": "a1",
                "fieldName": "click",
                "expression": null
        }],
        "postAggregations": [],
        "having": null,
        "limitSpec": {
                "type": "NoopLimitSpec"
        },
        "context": {
                "sqlOuterLimit": 100,
                "sqlQueryId": "c35534f5-38dc-4747-8812-36d2c063dd74"
        },
        "descending": false
   }], signature = [{
        d0: STRING,
        d1: STRING,
        d2: STRING,
        d3: STRING,
        d4: LONG,
        a0: LONG,
        a1: LONG
   }])`
   
   
   
   
   
   Debug Done:
   
   - Looks like derived aggregation like **(CAST(SUM(click) as 
FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_** is not considered while 
doing the count(*).
   - count(*) is using _"queryType": "timeseries"_ where as without count(*) 
uses _"queryType": "groupBy"_. Is that possible reason?
   
   Any help would be highly appreciated, thank you in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] SharuBob opened a new issue #11603: Druid SQL: Query with Count(*) is not returning actual number of rows with HAVING clause on derived columns

Reply via email to