GROUPING SET operators for advanced aggregations

jayapriya surendran (JIRA) Fri, 27 Mar 2015 15:05:23 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384766#comment-14384766
 ]


jayapriya surendran commented on PHOENIX-1772:
----------------------------------------------

The most common use case for CUBE and ROLLUP would be to perform online and 
offline aggregations. The multi-dimensional pre-aggregated values can be stored 
in database/HBase so that business analyst can perform lookup queries. I have 
illustrated the use case with examples below

*Offline aggregation example:*
1) A company's sales information is collected in HDFS/DBMS every day.
2) At the end of the day, a cube job is scheduled which computes aggregations 
for given dimensions and measures. Lets say dimensions are Country, State, City 
and measures are SUM(profit).
3) Computing cube is time and space intensive. As dimensions increases, time 
and space for storing the cube increases. So cube is computed offline using 
hive/pig and the aggregated results are stored in DBMS/HBase for serving.
4) Analyst can query the pre-aggregated results using look up queries like 
"what is the profit yesterday for USA, CA, San Jose?", "what is the overall 
profit yesterday for USA, CA?"  or "what is the global overall profit 
yesterday?". All such queries can be answered without any expensive 
computations.

*Online aggregation example:*
1) Lets say we store all the tweets in HBase.
2) Compute cube/rollup and store the aggregated results back to HBase. The 
stored cube (aggregated result) can be refreshed or discarded regularly.
3) This will help in answering some queries like "How many tweets were posted 
in the last hour in USA, CA, San Jose?", "How many tweets were posted in USA in 
the past 3 hours about elections?" etc.

I have seen similar use cases here 
http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/
 and here
https://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analytics

> Add CUBE/ROLLUP/GROUPING SET operators for advanced aggregations
> ----------------------------------------------------------------
>
>                 Key: PHOENIX-1772
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1772
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: jayapriya surendran
>            Assignee: jayapriya surendran
>              Labels: gsoc2015, java, sql
>         Attachments: GSoCProposal.pdf
>
>
> I noticed from Phoenix language documentation ( 
> http://phoenix.apache.org/language/index.html ) that Phoenix is missing 
> CUBE/ROLLUP and GROUPING_SET operators which are already supported by other 
> similar projects like Apache Pig and Apache Hive. Here is brief overview of 
> my proposal (the syntax that is proposed below is same as PostgreSQL 
> https://wiki.postgresql.org/wiki/Grouping_Sets)
> *Proposed syntax for CUBE:*
> SELECT name, place, SUM(count) FROM cars GROUP BY CUBE(name, place);
> For every row that we process we need to emit 2^n combinations of rows where 
> n corresponds to number of aggregate columns. For the above example query, 
> for every row we need to emit 4 rows, one for each level of aggregations 
> {(name, place), (name, *), (*, place), (*, *)}.
> *Proposed syntax for ROLLUP:*
> SELECT name, place, SUM(count) FROM cars GROUP BY ROLLUP(name, place);
> For every row that we process we need to emit n+1 combinations of rows where 
> n corresponds to number of aggregate columns. For the above example query, 
> for every row we need to emit 3 rows, one for each hierarchical level of 
> aggregations {(name, place), (name, *), (*, *)}.
> *Propose syntax for GROUPING_SETS:*
> SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS(name, ());
> For every row that we process we need to emit n combinations of rows where n 
> corresponds to size of grouping set. For the above example query, for every 
> row we need to emit 2 rows, one for each specified level of aggregations 
> {(name, *), (*, *)}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1772) Add CUBE/ROLLUP/GROUPING SET operators for advanced aggregations

Reply via email to