Stamatis Zampetakis created HIVE-28675:
------------------------------------------
Summary: Maximize the removal of redundant columns from GROUP BY
clauses
Key: HIVE-28675
URL: https://issues.apache.org/jira/browse/HIVE-28675
Project: Hive
Issue Type: Improvement
Components: CBO
Affects Versions: 4.0.1
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
HIVE-17043 introduced some logic to remove redundant columns from the GROUP BY
when the latter contains unique keys. The removal is done in a heuristic manner
by considering the first unique key that is part of the GROUP BY. In the
presence of multiple keys, the heuristic removal is incomplete and sub-optimal.
Consider the following example in which we have a table with three unique keys.
{code:sql}
CREATE TABLE passenger
(
id INT NOT NULL,
fname STRING NOT NULL,
lname STRING NOT NULL,
passport STRING NOT NULL,
UNIQUE (id) DISABLE RELY,
UNIQUE (passport) DISABLE RELY,
UNIQUE (fname, lname) DISABLE RELY
)
{code}
+Example I+
In the following query, the passport column can be removed from the GROUP BY
clause since it is not used in the SELECT clause and the id column is a unique
key.
{code:sql}
EXPLAIN CBO SELECT id, COUNT(1) FROM passenger GROUP BY id, passport
{code}
The optimizer correctly applies this transformation and the query plan is shown
below.
{noformat}
CBO PLAN:
HiveAggregate(group=[{0}], agg#0=[count()])
HiveTableScan(table=[[default, passenger]], table:alias=[passenger])
{noformat}
In the following query, the id column can be removed from the GROUP BY clause
since it is not used in the SELECT clause and the passport column is a unique
key.
{code:sql}
EXPLAIN CBO SELECT passport, COUNT(1) FROM passenger GROUP BY id, passport
{code}
The optimizer fails to apply the removal and we can see the aggregate operator
is on both id (0-index) and passport (3-index).
{noformat}
CBO PLAN:
HiveProject(passport=[$1], _o__c1=[$2])
HiveAggregate(group=[{0, 3}], agg#0=[count()])
HiveTableScan(table=[[default, passenger]], table:alias=[passenger])
{noformat}
Although the two queries above have a similar structure and the same keys, the
removal logic fails to create the expected plan thus it is in a sense
incomplete.
+Example II+
In the following query, there various column combinations that can be removed
from the GROUP BY clause.
{code:sql}
EXPLAIN CBO SELECT passport COUNT(1) FROM passenger GROUP BY id, fname, lname,
passport
{code}
The optimizer picks to remove the fname and lname columns but this is not the
optimal choice; we can observe that the id column (0-index) could also be
removed.
{noformat}
CBO PLAN:
HiveProject(passport=[$1], _o__c1=[$2])
HiveAggregate(group=[{0, 3}], agg#0=[count()])
HiveTableScan(table=[[default, passenger]], table:alias=[passenger])
{noformat}
The optimizer has multiple options in terms of which columns to retain:
* (passport)
* (id, passport)
* (fname, passport)
* (lname, passport)
* (id, fname, passport)
* (id, lname, passport)
* (id, fname, lname, passport)
The seemingly best choice is to keep only the passport column and drop all the
rest.
The goal of this ticket is to maximize the number of columns that are removed
by choosing the most promising unique key to retain in the GROUP BY clause. By
exploring all unique keys we address the completeness issue and by maximizing
the removed columns we achieve optimality in the number of columns dropped.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)