[
https://issues.apache.org/jira/browse/HIVE-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jesus Camacho Rodriguez resolved HIVE-15636.
--------------------------------------------
Resolution: Fixed
Assignee: Julian Hyde (was: Jesus Camacho Rodriguez)
Fix Version/s: 3.0.0
Fixed with upgrade to Calcite 1.12 (HIVE-15708).
> Hive/Druid integration: wrong semantics of topN query limit with granularity
> ----------------------------------------------------------------------------
>
> Key: HIVE-15636
> URL: https://issues.apache.org/jira/browse/HIVE-15636
> Project: Hive
> Issue Type: Bug
> Components: Druid integration
> Affects Versions: 2.2.0
> Reporter: Jesus Camacho Rodriguez
> Assignee: Julian Hyde
> Priority: Critical
> Fix For: 3.0.0
>
>
> Semantics of Druid topN query with limit and granularity is not equivalent to
> input SQL. In particular, limit is applied on each granularity value, not on
> the overall query.
> Currently, the following query will be transformed into a topN query:
> {code:sql}
> SELECT i_brand_id, floor_day(`__time`), max(ss_quantity),
> sum(ss_wholesale_cost) as s
> FROM store_sales_sold_time_subset
> GROUP BY i_brand_id, floor_day(`__time`)
> ORDER BY s DESC
> LIMIT 10;
> OK
> Plan optimized by CBO.
> Stage-0
> Fetch Operator
> limit:-1
> Stage-1
> Map 1 vectorized
> File Output Operator [FS_4]
> Select Operator [SEL_3] (rows=15888 width=0)
> Output:["_col0","_col1","_col2","_col3"]
> TableScan [TS_0] (rows=15888 width=0)
>
> tpcds_druid_10@store_sales_sold_time_subset,store_sales_sold_time_subset,Tbl:PARTIAL,Col:NONE,Output:["i_brand_id","__time","$f2","$f3"],properties:{"druid.query.json":"{\"queryType\":\"topN\",\"dataSource\":\"druid_tpcds_ss_sold_time_subset\",\"granularity\":\"DAY\",\"dimension\":\"i_brand_id\",\"metric\":\"$f3\",\"aggregations\":[{\"type\":\"longMax\",\"name\":\"$f2\",\"fieldName\":\"ss_quantity\"},{\"type\":\"doubleSum\",\"name\":\"$f3\",\"fieldName\":\"ss_wholesale_cost\"}],\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"threshold\":10}","druid.query.type":"topN"}
> {code}
> It outputs 300 rows, 10 per day. In turn, the equivalent SQL query for a
> Druid topN query should be expressed as:
> {code:sql}
> SELECT rs.i_brand_id, rs.d, rs.m, rs.s
> FROM (
> SELECT i_brand_id, floor_day(`__time`) as d, max(ss_quantity) as m,
> sum(ss_wholesale_cost) as s,
> ROW_NUMBER() OVER (PARTITION BY floor_day(`__time`) ORDER BY
> sum(ss_wholesale_cost) DESC ) AS rownum
> FROM store_sales_sold_time_subset
> GROUP BY i_brand_id, floor_day(`__time`)
> ) rs
> WHERE rownum <= 10;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)