Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20692
  
    thanks @gatorsmile and @rdblue for your discussion and thanks @gatorsmile 
for trusting me. I hope I will be worthy of your trust.
    
    I agree that since we are not close to a new release, having a throughout 
approach is better.
    
    Here is my analysis. Let me know if something else is needed. Basically, no 
other DB shows data type for columns. I won't include in this description the 
output for the various DBs, since for DB2 it is particularly verbose.
    Anyway, I confirm that DB2's output is what is shown in the article posted 
by @gatorsmile.
    Postgres's examples can be found here: 
https://www.postgresql.org/docs/10/static/using-explain.html.
    Oracle's ones here: 
https://docs.oracle.com/cd/B10501_01/server.920/a96540/statements_911a.htm#2061798.
    MySQL is not very different too, here it is a JSON representation of an 
example:
    ```
    {
      "query_block": {
        "select_id": 1,
        "nested_loop": [
          {
            "table": {
              "table_name": "u",
              "access_type": "ALL",
              "possible_keys": [
                "PRIMARY"
              ],
              "rows": 4,
              "filtered": 100
            }
          },
          {
            "table": {
              "table_name": "v",
              "access_type": "ref",
              "possible_keys": [
                "PRIMARY"
              ],
              "key": "PRIMARY",
              "used_key_parts": [
                "username"
              ],
              "key_length": "258",
              "ref": [
                "DBNAME.u.username"
              ],
              "rows": 1,
              "filtered": 100,
              "using_index": true
            }
          }
        ]
      }
    }
    ```
    Here is SQLServer output (quite similar to the others too):
    ```
    StmtText                                                                    
                                                                       StmtId   
   NodeId      Parent      PhysicalOp                     LogicalOp             
         Argument                                                               
                                               DefinedValues                    
                 EstimateRows   EstimateIO     EstimateCPU    AvgRowSize  
TotalSubtreeCost OutputList                                                     
                                     Warnings Type                              
                               Parallel EstimateExecutions
    
--------------------------------------------------------------------------------------------------------------------------------------------------
 ----------- ----------- ----------- ------------------------------ 
------------------------------ 
---------------------------------------------------------------------------------------------------------------------
 ------------------------------------------------- -------------- 
-------------- -------------- ----------- ---------------- 
---------------------------------------------------------------------------------------------------
 -------- ---------------------------------------------------------------- 
-------- ------------------
    select * from t1 join t2 on t1.a=t2.a;                                      
                                                                                
 1           1           0 NULL                           NULL                  
         1                                                                      
                                               NULL                             
                            1.0           NULL           NULL        NULL     
2.4343444E-2 NULL                                                               
                                 NULL     SELECT                                
                                  0               NULL
      |--Hash Match(Inner Join, 
HASH:([master].[dbo].[t2].[a])=([master].[dbo].[t1].[a]), 
RESIDUAL:([master].[dbo].[t2].[a]=[master].[dbo].[t1].[a]))            1        
   2           1 Hash Match                     Inner Join                     
HASH:([master].[dbo].[t2].[a])=([master].[dbo].[t1].[a]), 
RESIDUAL:([master].[dbo].[t2].[a]=[master].[dbo].[t1].[a])  NULL                
                                         1.0            0.0   1.7774245E-2      
    23     2.4343444E-2 [master].[dbo].[t1].[a], [master].[dbo].[t1].[b], 
[master].[dbo].[t2].[a], [master].[dbo].[t2].[b]  NULL     PLAN_ROW             
                                                   0                1.0
           |--Table Scan(OBJECT:([master].[dbo].[t2]))                          
                                                                                
 1           3           2 Table Scan                     Table Scan            
         OBJECT:([master].[dbo].[t2])                                           
                                               [master].[dbo].[t2].[a], 
[master].[dbo].[t2].[b]             1.0       0.003125      0.0001581          
16        0.0032831 [master].[dbo].[t2].[a], [master].[dbo].[t2].[b]            
                                        NULL     PLAN_ROW                       
                                         0                1.0
           |--Table Scan(OBJECT:([master].[dbo].[t1]))                          
                                                                                
 1           4           2 Table Scan                     Table Scan            
         OBJECT:([master].[dbo].[t1])                                           
                                               [master].[dbo].[t1].[a], 
[master].[dbo].[t1].[b]             1.0       0.003125      0.0001581          
16        0.0032831 [master].[dbo].[t1].[a], [master].[dbo].[t1].[b]            
                                        NULL     PLAN_ROW                       
                                         0                1.0
    ```
    
    The only one which shows datatype information is Hive. Hive's EXPLAIN 
actually shows a lot of information for every operation and it shows the 
datatype for every projection:
    ```
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1
    
    STAGE PLANS:
      Stage: Stage-1
        Tez
          Edges:
            Reducer 2 <- Map 1 (SIMPLE_EDGE)
            Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
          DagName: root_20180302103255_e34d204d-6a8d-46b0-9932-ff7e0b024493:1
          Vertices:
            Map 1 
                Map Operator Tree:
                    TableScan
                      alias: store_sales_orc
                      Statistics: Num rows: 1000 Data size: 88276 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: ss_sold_date_sk (type: int), ss_quantity 
(type: int)
                        outputColumnNames: ss_sold_date_sk, ss_quantity
                        Statistics: Num rows: 1000 Data size: 88276 Basic 
stats: COMPLETE Column stats: NONE
                        Group By Operator
                          aggregations: count(ss_quantity)
                          keys: ss_sold_date_sk (type: int)
                          mode: hash
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 1000 Data size: 88276 Basic 
stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 1000 Data size: 88276 Basic 
stats: COMPLETE Column stats: NONE
                            value expressions: _col1 (type: bigint)
                Execution mode: vectorized
            Reducer 2 
                Reduce Operator Tree:
                  Group By Operator
                    aggregations: count(VALUE._col0)
                    keys: KEY._col0 (type: int)
                    mode: mergepartial
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 500 Data size: 44138 Basic stats: 
COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col1 (type: bigint)
                      sort order: -
                      Statistics: Num rows: 500 Data size: 44138 Basic stats: 
COMPLETE Column stats: NONE
                      value expressions: _col0 (type: int)
                Execution mode: vectorized
            Reducer 3 
                Reduce Operator Tree:
                  Select Operator
                    expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 
(type: bigint)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 500 Data size: 44138 Basic stats: 
COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 500 Data size: 44138 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                Execution mode: vectorized
    
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            ListSink
    ```
    
    What do you think? What should be Spark behavior according to you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to