Stamatis Zampetakis created HIVE-28760:
------------------------------------------

             Summary: Support WITHIN GROUP in COLLECT_LIST aggregate function
                 Key: HIVE-28760
                 URL: https://issues.apache.org/jira/browse/HIVE-28760
             Project: Hive
          Issue Type: New Feature
          Components: SQL
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


The WITHIN GROUP clause lets aggregate functions operate on a sorted list of 
rows and its already supported by Hive in some aggregate functions (e.g. 
HIVE-16255). 

The goal of this ticket is to support the use of WITHIN GROUP along 
COLLECT_LIST aggregate function to be able to impose an order on the array 
returned by COLLECT_LIST.

This feature would be extremely useful for scenarios like the one described 
below.
{code:sql}
CREATE TABLE customer (id INT, name STRING, age INT, country STRING);

INSERT INTO customer VALUES 
(0, 'Bob', 22, 'GREECE'),
(1, 'Alice', 34, 'FRANCE'),
(2, 'Victor', 80, 'GERMANY'),
(3, 'Alicia', 55, 'GREECE'),
(4, 'Frank', 18, 'FRANCE'),
(5, 'Robin', 15, 'FRANCE'),
(6, 'Paul', 75, 'GERMANY');
{code}

_Group all customers by country and for each country collect all the customer 
names sorted by the age of each customer._

{code:sql}
SELECT country, COLLECT_LIST(name) WITHIN GROUP (ORDER BY age) as all_names
FROM customer
GROUP BY country;
{code}

{noformat}
+----------+----------------------------+
| country  |         all_names          |
+----------+----------------------------+
| GERMANY  | ["Paul","Victor"]          |
| FRANCE   | ["Robin","Frank","Alice"]  |
| GREECE   | ["Bob","Alicia"]           |
+----------+----------------------------+
{noformat}

There are a few other ways to achieve the same result currently in Hive (e.g. 
window functions, sort_arrray_by) but they are more cumbersome to write and 
depending how they are written less efficient. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to