[jira] [Resolved] (HIVE-15636) Hive/Druid integration: wrong semantics of topN query limit with granularity

Jesus Camacho Rodriguez (JIRA) Tue, 18 Apr 2017 02:12:01 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jesus Camacho Rodriguez resolved HIVE-15636.
--------------------------------------------
       Resolution: Fixed
         Assignee: Julian Hyde  (was: Jesus Camacho Rodriguez)
    Fix Version/s: 3.0.0

Fixed with upgrade to Calcite 1.12 (HIVE-15708).

> Hive/Druid integration: wrong semantics of topN query limit with granularity
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-15636
>                 URL: https://issues.apache.org/jira/browse/HIVE-15636
>             Project: Hive
>          Issue Type: Bug
>          Components: Druid integration
>    Affects Versions: 2.2.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Julian Hyde
>            Priority: Critical
>             Fix For: 3.0.0
>
>
> Semantics of Druid topN query with limit and granularity is not equivalent to 
> input SQL. In particular, limit is applied on each granularity value, not on 
> the overall query.
> Currently, the following query will be transformed into a topN query:
> {code:sql}
> SELECT i_brand_id, floor_day(`__time`), max(ss_quantity), 
> sum(ss_wholesale_cost) as s
> FROM store_sales_sold_time_subset
> GROUP BY i_brand_id, floor_day(`__time`)
> ORDER BY s DESC
> LIMIT 10;
> OK
> Plan optimized by CBO.
> Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Map 1 vectorized
>       File Output Operator [FS_4]
>         Select Operator [SEL_3] (rows=15888 width=0)
>           Output:["_col0","_col1","_col2","_col3"]
>           TableScan [TS_0] (rows=15888 width=0)
>             
> tpcds_druid_10@store_sales_sold_time_subset,store_sales_sold_time_subset,Tbl:PARTIAL,Col:NONE,Output:["i_brand_id","__time","$f2","$f3"],properties:{"druid.query.json":"{\"queryType\":\"topN\",\"dataSource\":\"druid_tpcds_ss_sold_time_subset\",\"granularity\":\"DAY\",\"dimension\":\"i_brand_id\",\"metric\":\"$f3\",\"aggregations\":[{\"type\":\"longMax\",\"name\":\"$f2\",\"fieldName\":\"ss_quantity\"},{\"type\":\"doubleSum\",\"name\":\"$f3\",\"fieldName\":\"ss_wholesale_cost\"}],\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"threshold\":10}","druid.query.type":"topN"}
> {code}
> It outputs 300 rows, 10 per day. In turn, the equivalent SQL query for a 
> Druid topN query should be expressed as:
> {code:sql}
> SELECT rs.i_brand_id, rs.d, rs.m, rs.s
> FROM (
>     SELECT i_brand_id, floor_day(`__time`) as d, max(ss_quantity) as m, 
> sum(ss_wholesale_cost) as s,
>            ROW_NUMBER() OVER (PARTITION BY floor_day(`__time`) ORDER BY 
> sum(ss_wholesale_cost) DESC ) AS rownum
>     FROM store_sales_sold_time_subset
>     GROUP BY i_brand_id, floor_day(`__time`)
> ) rs
> WHERE rownum <= 10;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (HIVE-15636) Hive/Druid integration: wrong semantics of topN query limit with granularity

Reply via email to