Stamatis Zampetakis created HIVE-29257:
------------------------------------------
Summary: Remove OPTIMIZED SQL entry from EXPLAIN EXTENDED
Key: HIVE-29257
URL: https://issues.apache.org/jira/browse/HIVE-29257
Project: Hive
Issue Type: Task
Components: CBO
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis
The {{EXPLAIN EXTENDED}} statement among other things prints a serialization of
the CBO plan in the form of SQL (so called OPTIMIZED SQL). This feature was
introduced in HIVE-19360 for diagnosability purposes to assist in the
understanding of the DAG plan with respect to join ordering.
However, in most cases when developers want to understand the join order they
look into the CBO plan and not in the OPTIMIZED SQL output. Determining the
join order from SQL becomes challenging for queries that contain more than a
few tables and nested SQL constructs make the situation worse. The CBO RelNode
tree is usually easier to navigate.
In various cases the SQL serialization is an invalid SQL query. This led people
to believe that the CBO plan is wrong. As a consequence, they attempt to fix
the CBO plan while in fact the plan is perfectly valid. The bug usually lies
only in the serialization part but the latter does not have any effect on the
execution of the query.
Users (and sometimes developers) believe that the OPTIMIZED SQL is the query
that is actually run by Hive. There have been various cases where people
copy-paste the OPTIMIZED SQL entry and try to run it when they have issues with
the original query and when that fails as well it adds up to the confusion.
In addition, since the serializer is buggy it sometimes raises exceptions and
the OPTIMIZED SQL does not appear in the output. People will raise bug fixes
and invest effort in a feature that is not very useful.
Currently, there are 282 .q.out files that use {{EXPLAIN EXTENDED}} and thus
contain an entry of OPTIMIZED SQL. Even minor changes in the CBO plan do affect
the OPTIMIZED SQL necessitating updates and reviews on multiple files.
Typically calcite upgrades trigger many changes in OPTIMIZED SQL.
{noformat}
$ find . -name "*.q.out" -exec grep -a "OPTIMIZED SQL" {} \; | wc -l
820
$ find . -name "*.q.out" -exec grep -l "OPTIMIZED SQL" {} \; | wc -l
282
{noformat}
I propose to remove the {{OPTIMIZED SQL}} feature and related code to reduce:
* maintenance overhead
* plan/file (e.g., .q.out) size
* users/dev confusion about its meaning/usage
--
This message was sent by Atlassian Jira
(v8.20.10#820010)