Stamatis Zampetakis created HIVE-28760:
------------------------------------------
Summary: Support WITHIN GROUP in COLLECT_LIST aggregate function
Key: HIVE-28760
URL: https://issues.apache.org/jira/browse/HIVE-28760
Project: Hive
Issue Type: New Feature
Components: SQL
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
The WITHIN GROUP clause lets aggregate functions operate on a sorted list of
rows and its already supported by Hive in some aggregate functions (e.g.
HIVE-16255).
The goal of this ticket is to support the use of WITHIN GROUP along
COLLECT_LIST aggregate function to be able to impose an order on the array
returned by COLLECT_LIST.
This feature would be extremely useful for scenarios like the one described
below.
{code:sql}
CREATE TABLE customer (id INT, name STRING, age INT, country STRING);
INSERT INTO customer VALUES
(0, 'Bob', 22, 'GREECE'),
(1, 'Alice', 34, 'FRANCE'),
(2, 'Victor', 80, 'GERMANY'),
(3, 'Alicia', 55, 'GREECE'),
(4, 'Frank', 18, 'FRANCE'),
(5, 'Robin', 15, 'FRANCE'),
(6, 'Paul', 75, 'GERMANY');
{code}
_Group all customers by country and for each country collect all the customer
names sorted by the age of each customer._
{code:sql}
SELECT country, COLLECT_LIST(name) WITHIN GROUP (ORDER BY age) as all_names
FROM customer
GROUP BY country;
{code}
{noformat}
+----------+----------------------------+
| country | all_names |
+----------+----------------------------+
| GERMANY | ["Paul","Victor"] |
| FRANCE | ["Robin","Frank","Alice"] |
| GREECE | ["Bob","Alicia"] |
+----------+----------------------------+
{noformat}
There are a few other ways to achieve the same result currently in Hive (e.g.
window functions, sort_arrray_by) but they are more cumbersome to write and
depending how they are written less efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)