[GitHub] [spark] dilipbiswal opened a new pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command

GitBox Fri, 31 May 2019 08:56:28 -0700

dilipbiswal opened a new pull request #24759: [SPARK-27395][SQL][WIP] Improve 
EXPLAIN command
URL: https://github.com/apache/spark/pull/24759
 
 
   ## What changes were proposed in this pull request?
   This is a WIP PR that aims at improving the way physical plans are explained 
in spark.
   
   Currently, the explain output for physical plan may look very cluttered and 
each operator's
   string representation can be very wide and wraps around in the display 
making it little
   hard to follow. This especially happens when explaining a query 1) Operating 
on wide tables
   2) Has complex expressions etc.
   
   This PR attempts to split the output into two sections. In the header 
section, we display
   the basic operator tree with a number associated with each operator. In this 
section, we strictly
   control what we output for each operator. In the footer section, each 
operator is verbosely
   displayed. Based on the feedback from Maryann, the uncorrelated subqueries 
(SubqueryExecs) are not included in the main plan. They are printed separately 
after the main plan and can be
   correlated by the originating expression id from its parent plan.
   
   To illustrate, here is a simple plan displayed in old vs new way.
   
   Example query1 :
   ```
   EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key 
HAVING max(val) > 0
   ```
   
   Old :
   ```
   *(2) Project [key#2, max(val)#15]
   +- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
      +- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], 
output=[key#2, max(val)#15, max(val#3)#18])
         +- Exchange hashpartitioning(key#2, 200)
            +- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], 
output=[key#2, max#21])
               +- *(1) Project [key#2, val#3]
                  +- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
                     +- *(1) FileScan parquet 
default.explain_temp1[key#2,val#3] Batched: true, DataFilters: 
[isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: 
InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: 
[], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: 
struct<key:int,val:int>
   ```
   New :
   ```
   Project (8)
   +- Filter (7)
      +- HashAggregate (6)
         +- Exchange (5)
            +- HashAggregate (4)
               +- Project (3)
                  +- Filter (2)
                     +- Scan parquet default.explain_temp1 (1)
   
   
   (1) Scan parquet default.explain_temp1 [codegen id : 1]
   Output: [key#2, val#3]
   
   (2) Filter [codegen id : 1]
   Input     : [key#2, val#3]
   Condition : (isnotnull(key#2) AND (key#2 > 0))
   
   (3) Project [codegen id : 1]
   Output    : [key#2, val#3]
   Input     : [key#2, val#3]
   
   (4) HashAggregate [codegen id : 1]
   Input: [key#2, val#3]
   
   (5) Exchange
   Input: [key#2, max#11]
   
   (6) HashAggregate [codegen id : 2]
   Input: [key#2, max#11]
   
   (7) Filter [codegen id : 2]
   Input     : [key#2, max(val)#5, max(val#3)#8]
   Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0))
   
   (8) Project [codegen id : 2]
   Output    : [key#2, max(val)#5]
   Input     : [key#2, max(val)#5, max(val#3)#8]
   ```
   
   Example Query2 (subquery):
   ```
   SELECT * FROM   explain_temp1 WHERE  KEY = (SELECT Max(KEY) FROM   
explain_temp2 WHERE  KEY = (SELECT Max(KEY) FROM   explain_temp3 WHERE  val > 
0) AND val = 2) AND val > 3
   ```
   Old:
   ```
   *(1) Project [key#2, val#3]
   +- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = 
Subquery scalar-subquery#39)) AND (val#3 > 3))
      :  +- Subquery scalar-subquery#39
      :     +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], 
output=[max(KEY)#45])
      :        +- Exchange SinglePartition
      :           +- *(1) HashAggregate(keys=[], 
functions=[partial_max(KEY#26)], output=[max#47])
      :              +- *(1) Project [key#26]
      :                 +- *(1) Filter (((isnotnull(KEY#26) AND 
isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2))
      :                    :  +- Subquery scalar-subquery#38
      :                    :     +- *(2) HashAggregate(keys=[], 
functions=[max(KEY#28)], output=[max(KEY)#43])
      :                    :        +- Exchange SinglePartition
      :                    :           +- *(1) HashAggregate(keys=[], 
functions=[partial_max(KEY#28)], output=[max#49])
      :                    :              +- *(1) Project [key#28]
      :                    :                 +- *(1) Filter (isnotnull(val#29) 
AND (val#29 > 0))
      :                    :                    +- *(1) FileScan parquet 
default.explain_temp3[key#28,val#29] Batched: true, DataFilters: 
[isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: 
InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: 
[], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: 
struct<key:int,val:int>
      :                    +- *(1) FileScan parquet 
default.explain_temp2[key#26,val#27] Batched: true, DataFilters: 
[isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, 
Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], 
PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), 
EqualTo(val,2)], ReadSchema: struct<key:int,val:int>
      +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: 
true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: 
Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], 
PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), 
GreaterThan(val,3)], ReadSchema: struct<key:int,val:int>
   ```
   New:
   ```
   Project (3)
   +- Filter (2)
      +- Scan parquet default.explain_temp1 (1)
   
   
   (1) Scan parquet default.explain_temp1 [codegen id : 1]
   Output: [key#2, val#3]
   
   (2) Filter [codegen id : 1]
   Input     : [key#2, val#3]
   Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery 
scalar-subquery#23)) AND (val#3 > 3))
   
   (3) Project [codegen id : 1]
   Output    : [key#2, val#3]
   Input     : [key#2, val#3]
   ===== Subqueries =====
   
   Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery 
scalar-subquery#23
   HashAggregate (9)
   +- Exchange (8)
      +- HashAggregate (7)
         +- Project (6)
            +- Filter (5)
               +- Scan parquet default.explain_temp2 (4)
   
   
   (4) Scan parquet default.explain_temp2 [codegen id : 1]
   Output: [key#26, val#27]
   
   (5) Filter [codegen id : 1]
   Input     : [key#26, val#27]
   Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = 
Subquery scalar-subquery#22)) AND (val#27 = 2))
   
   (6) Project [codegen id : 1]
   Output    : [key#26]
   Input     : [key#26, val#27]
   
   (7) HashAggregate [codegen id : 1]
   Input: [key#26]
   
   (8) Exchange
   Input: [max#35]
   
   (9) HashAggregate [codegen id : 2]
   Input: [max#35]
   
   Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery 
scalar-subquery#22
   HashAggregate (15)
   +- Exchange (14)
      +- HashAggregate (13)
         +- Project (12)
            +- Filter (11)
               +- Scan parquet default.explain_temp3 (10)
   
   
   (10) Scan parquet default.explain_temp3 [codegen id : 1]
   Output: [key#28, val#29]
   
   (11) Filter [codegen id : 1]
   Input     : [key#28, val#29]
   Condition : (isnotnull(val#29) AND (val#29 > 0))
   
   (12) Project [codegen id : 1]
   Output    : [key#28]
   Input     : [key#28, val#29]
   
   (13) HashAggregate [codegen id : 1]
   Input: [key#28]
   
   (14) Exchange
   Input: [max#37]
   
   (15) HashAggregate [codegen id : 2]
   Input: [max#37]
   ```
   
   Note:
   I opened this PR as a WIP to start getting feedback. I will be on vacation 
starting tomorrow
   would not be able to immediately incorporate the feedback. I will start to
   work on them as soon as i can. Also, currently this PR provides a basic 
infrastructure
   for explain enhancement. The details about individual operators will be 
implemented
   in follow-up prs
   ## How was this patch tested?
   Added a new test `explain.sql` that tests basic scenarios. Need to add more 
tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dilipbiswal opened a new pull request #24759: [SPARK-27395][SQL][WIP] Improve EXPLAIN command

Reply via email to