Stamatis Zampetakis created HIVE-29257:
------------------------------------------

             Summary: Remove OPTIMIZED SQL entry from EXPLAIN EXTENDED
                 Key: HIVE-29257
                 URL: https://issues.apache.org/jira/browse/HIVE-29257
             Project: Hive
          Issue Type: Task
          Components: CBO
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


The {{EXPLAIN EXTENDED}} statement among other things prints a serialization of 
the CBO plan in the form of SQL (so called OPTIMIZED SQL). This feature was 
introduced in HIVE-19360 for diagnosability purposes to assist in the 
understanding of the DAG plan with respect to join ordering.

However, in most cases when developers want to understand the join order they 
look into the CBO plan and not in the OPTIMIZED SQL output. Determining the 
join order from SQL becomes challenging for queries that contain more than a 
few tables and nested SQL constructs make the situation worse. The CBO RelNode 
tree is usually easier to navigate.

In various cases the SQL serialization is an invalid SQL query. This led people 
to believe that the CBO plan is wrong. As a consequence, they attempt to fix 
the CBO plan while in fact the plan is perfectly valid. The bug usually lies 
only in the serialization part but the latter does not have any effect on the 
execution of the query.

Users (and sometimes developers) believe that the OPTIMIZED SQL is the query 
that is actually run by Hive. There have been various cases where people 
copy-paste the OPTIMIZED SQL entry and try to run it when they have issues with 
the original query and when that fails as well it adds up to the confusion.

In addition, since the serializer is buggy it sometimes raises exceptions and 
the OPTIMIZED SQL does not appear in the output. People will raise bug fixes 
and invest effort in a feature that is not very useful.

Currently, there are 282 .q.out files that use {{EXPLAIN EXTENDED}} and thus 
contain an entry of OPTIMIZED SQL. Even minor changes in the CBO plan do affect 
the OPTIMIZED SQL necessitating updates and reviews on multiple files. 
Typically calcite upgrades trigger many changes in OPTIMIZED SQL.

{noformat}
$ find . -name "*.q.out" -exec grep -a "OPTIMIZED SQL" {} \; | wc -l
820
$ find . -name "*.q.out" -exec grep -l "OPTIMIZED SQL" {} \; | wc -l
282
{noformat}

I propose to remove the {{OPTIMIZED SQL}} feature and related code to reduce:
* maintenance overhead
* plan/file (e.g., .q.out) size
* users/dev confusion about its meaning/usage




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to