chelubai created KYLIN-1467:
-------------------------------
Summary: subquery optimization: remove unused subquery output
columns
Key: KYLIN-1467
URL: https://issues.apache.org/jira/browse/KYLIN-1467
Project: Kylin
Issue Type: Improvement
Components: Query Engine
Affects Versions: v1.2
Reporter: chelubai
Assignee: liyang
Priority: Minor
Table 'tb' has more than 10,000,000 rows;
-- sql-1:
select a from tb group by a;
The sql-1 result has only 3 rows:
v1
v2
v3
When query kylin through BI tool tableau, the generated sql:
-- sql-2:
SELECT "X___SQL___"."a" AS "a"
FROM (
select a,b,c,d
from tb
group by a,b,c,d )
"X___SQL___"
GROUP BY "X___SQL___"."a"
the sql-1 and sql-2 is equivalent, but sql-2 fails with message:
Scan row count exceeded threshold: 1000000, please add filter condition to
narrow down backend scan range, like where clause.
one solution:
Add a calcite planner optimization rule to kylin-query, to remove the columns
of the subquery that is not used outside, both from select, group by and order
by.
so the sql-2 may be optimized to:
SELECT "X___SQL___"."a" AS "a"
FROM (
select a
from tb
group by a )
"X___SQL___"
GROUP BY "X___SQL___"."a"
This sql run successfully.
Donot know if the exiting products like mysql/oracle/hive/presto/spark... have
this optimization or not.
Any comment is appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)