[GitHub] [incubator-doris] Youngwb opened a new issue #4674: Support approximate TopN function for Doris

GitBox Sat, 26 Sep 2020 04:04:33 -0700


Youngwb opened a new issue #4674:
URL: https://github.com/apache/incubator-doris/issues/4674



   **Is your feature request related to a problem? Please describe.**
   It is a common scenario for a database to calculate top-n from a dataset, 
sort a column and return the Top elements.  TopN could uses an approximation 
algorithms to quickly return results with little overhead.
   
   There are many algorithms to quickly calculate frequent items from the data 
set, among which the **SpaceSaving** algorithm has a good performance for 
reference _[1]_. Kylin and PostgresSQL have implemented TopN approximation 
algorithms based on this algorithm and have performed well. We can implement 
this algorithm in Doris to compute TopN quickly.
   
   **Describe the solution you'd like**
   Kylin's current implementation computes the topN of a measure column based 
on dimensional columns, such like `group by dimension_cols order by 
measure_column limit 10` 
[reference](https://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/) 
 while PostgresSQL uses the topN function to computes the frequent entries of a 
column. [reference](https://github.com/citusdata/postgresql-topn) 
   
   For the current Doris, the lack of UDTF support makes it difficult to 
calculate topN on dimension columns and measure column like Kylin. So it is 
more appropriate to calculate topN using an aggregate function like 
PostgresSQL, and present the results in JSON format.
   
   eg:
   ```
   CREATE TABLE popular_ad
   (
     event_date date NOT NULL ,
     ad_id BIGINT NOY NULL
   );
   ```
   For example, we can calculate the ad_id that is displayed or clicked most 
per day.
   
   ```
   select event_date, topn(ad_id, 1)  as top_1 from popular_ad group by 
event_date;
   ```
   ```
                  event_date         |      top_1
   ----------------------------------+--------------------
   2020-01-01                        |  {"xxxx":"10000"}
   2020-01-02                        |  {"xxx":"11111"}
   ```
   
   The result represents the top1 item and its corresponding frequency.
   
   In the future, with Doris supporting UDTF, we  can display item and 
frequency based on this function, or topN for the SUM aggregate type column 
group by key columns
   
   
   **Additional context**
   [1] Efficient Computation of Frequent and Top-k Elements in Data Streams by 
Metwally, Agrawal, and Abbadi.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] Youngwb opened a new issue #4674: Support approximate TopN function for Doris

Reply via email to