Siddharth Seth created HIVE-15122:
-------------------------------------

             Summary: Hive: Upcasting types should not obscure stats 
(min/max/ndv)
                 Key: HIVE-15122
                 URL: https://issues.apache.org/jira/browse/HIVE-15122
             Project: Hive
          Issue Type: Bug
            Reporter: Siddharth Seth


A UDFToLong breaks PK/FK inferences and triggers mis-estimation of joins in 
LLAP.

Snippet from the bad plan.
{code}
| STAGE PLANS:                                                                  
                                                                                
           |
|   Stage: Stage-1                                                              
                                                                                
           |
|     Tez                                                                       
                                                                                
           |
|       DagId: hive_20161031222730_a700058f-78eb-40d6-a67d-43add60a50e2:6       
                                                                                
           |
|       Edges:                                                                  
                                                                                
           |
|         Map 2 <- Map 1 (BROADCAST_EDGE)                                       
                                                                                
           |
|         Map 3 <- Map 2 (BROADCAST_EDGE)                                       
                                                                                
           |
|         Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE), Map 7 (CUSTOM_SIMPLE_EDGE), 
Map 8 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE)                                  
            |
|         Reducer 5 <- Reducer 4 (SIMPLE_EDGE)                                  
                                                                                
           |
|         Reducer 6 <- Reducer 5 (SIMPLE_EDGE)                                  
                                                                                
           |
|       DagName:                                                                
                                                                                
           |
|       Vertices:                                                               
                                                                                
           |
|         Map 1                                                                 
                                                                                
           |
|             Map Operator Tree:                                                
                                                                                
           |
|                 TableScan                                                     
                                                                                
           |
|                   alias: supplier                                             
                                                                                
           |
|                   filterExpr: (s_suppkey is not null and s_nationkey is not 
null) (type: boolean)                                                           
             |
|                   Statistics: Num rows: 10000000 Data size: 160000000 Basic 
stats: COMPLETE Column stats: COMPLETE                                          
             |
|                   Filter Operator                                             
                                                                                
           |
|                     predicate: (s_suppkey is not null and s_nationkey is not 
null) (type: boolean)                                                           
            |
|                     Statistics: Num rows: 10000000 Data size: 160000000 Basic 
stats: COMPLETE Column stats: COMPLETE                                          
           |
|                     Select Operator                                           
                                                                                
           |
|                       expressions: s_suppkey (type: bigint), s_nationkey 
(type: bigint)                                                                  
                |
|                       outputColumnNames: _col0, _col1                         
                                                                                
           |
|                       Statistics: Num rows: 10000000 Data size: 160000000 
Basic stats: COMPLETE Column stats: COMPLETE                                    
               |
|                       Reduce Output Operator                                  
                                                                                
           |
|                         key expressions: _col0 (type: bigint)                 
                                                                                
           |
|                         sort order: +                                         
                                                                                
           |
|                         Map-reduce partition columns: _col0 (type: bigint)    
                                                                                
           |
|                         Statistics: Num rows: 10000000 Data size: 160000000 
Basic stats: COMPLETE Column stats: COMPLETE                                    
             |
|                         value expressions: _col1 (type: bigint)               
                                                                                
           |
|             Execution mode: vectorized, llap                                  
                                                                                
           |
|             LLAP IO: all inputs                                               
                                                                                
           |
|         Map 2                                                                 
                                                                                
           |
|             Map Operator Tree:                                                
                                                                                
           |
|                 TableScan                                                     
                                                                                
           |
|                   alias: lineitem                                             
                                                                                
           |
|                   filterExpr: (l_suppkey is not null and l_orderkey is not 
null) (type: boolean)                                                           
              |
|                   Statistics: Num rows: 2285121364 Data size: 63983407882 
Basic stats: COMPLETE Column stats: PARTIAL                                     
               |
|                   Filter Operator                                             
                                                                                
           |
|                     predicate: (l_suppkey is not null and l_orderkey is not 
null) (type: boolean)                                                           
             |
|                     Statistics: Num rows: 2285121364 Data size: 127966796384 
Basic stats: COMPLETE Column stats: PARTIAL                                     
            |
|                     Select Operator                                           
                                                                                
           |
|                       expressions: l_orderkey (type: bigint), l_suppkey 
(type: int), l_extendedprice (type: double), l_discount (type: double), 
l_shipdate (type: date)  |
|                       outputColumnNames: _col0, _col1, _col2, _col3, _col4    
                                                                                
           |
|                       Statistics: Num rows: 2285121364 Data size: 
127966796384 Basic stats: COMPLETE Column stats: PARTIAL                        
                       |
|                       Map Join Operator                                       
                                                                                
           |
|                         condition map:                                        
                                                                                
           |
|                              Inner Join 0 to 1                                
                                                                                
           |
|                         keys:                                                 
                                                                                
           |
|                           0 _col0 (type: bigint)                              
                                                                                
           |
|                           1 UDFToLong(_col1) (type: bigint)                   
                                                                                
           |
|                         outputColumnNames: _col1, _col2, _col4, _col5, _col6  
                                                                                
           |
|                         input vertices:                                       
                                                                                
           |
|                           0 Map 1                                             
                                                                                
           |
|                         Statistics: Num rows: 10000000 Data size: 880000000 
Basic stats: COMPLETE Column stats: PARTIAL                                     
             |
|                         Reduce Output Operator                                
                                                                                
           |
|                           key expressions: _col2 (type: bigint)               
                                                                                
           |
|                           sort order: +                                       
                                                                                
           |
|                           Map-reduce partition columns: _col2 (type: bigint)  
                                                                                
           |
|                           Statistics: Num rows: 10000000 Data size: 880000000 
Basic stats: COMPLETE Column stats: PARTIAL                                     
           |
|                           value expressions: _col1 (type: bigint), _col4 
(type: double), _col5 (type: double), _col6 (type: date)                        
                |
|             Execution mode: vectorized, llap                                  
                                                                                
           |
|             LLAP IO: all inputs                                               
                                                                                
           |
|         Map 3                                                                 
                                                                                
           |
|             Map Operator Tree:                                                
                                                                                
           |
|                 TableScan                                                     
                                                                                
           |
|                   alias: orders                                               
                                                                                
           |
|                   filterExpr: (o_orderkey is not null and o_custkey is not 
null) (type: boolean)                                                           
              |
|                   Statistics: Num rows: 4318801126 Data size: 51825626753 
Basic stats: COMPLETE Column stats: NONE                                        
               |
|                   Filter Operator                                             
                                                                                
           |
|                     predicate: (o_orderkey is not null and o_custkey is not 
null) (type: boolean)                                                           
             |
|                     Statistics: Num rows: 4318801126 Data size: 51825626753 
Basic stats: COMPLETE Column stats: NONE                                        
             |
|                     Select Operator                                           
                                                                                
           |
|                       expressions: o_orderkey (type: int), o_custkey (type: 
bigint)                                                                         
             |
|                       outputColumnNames: _col0, _col1                         
                                                                                
           |
|                       Statistics: Num rows: 4318801126 Data size: 51825626753 
Basic stats: COMPLETE Column stats: NONE                                        
           |
|                       Map Join Operator                                       
                                                                                
           |
|                         condition map:                                        
                                                                                
           |
|                              Inner Join 0 to 1                                
                                                                                
           |
|                         keys:                                                 
                                                                                
           |
|                           0 _col2 (type: bigint)                              
                                                                                
           |
|                           1 UDFToLong(_col0) (type: bigint)                   
                                                                                
           |
|                         outputColumnNames: _col1, _col4, _col5, _col6, _col8  
                                                                                
           |
|                         input vertices:                                       
                                                                                
           |
|                           0 Map 2                                             
                                                                                
           |
|                         Statistics: Num rows: 4750681341 Data size: 
57008190663 Basic stats: COMPLETE Column stats: NONE                            
                     |
|                         Reduce Output Operator                                
                                                                                
           |
|                           key expressions: _col8 (type: bigint)               
                                                                                
           |
|                           sort order: +                                       
                                                                                
           |
|                           Map-reduce partition columns: _col8 (type: bigint)  
                                                                                
           |
|                           Statistics: Num rows: 4750681341 Data size: 
57008190663 Basic stats: COMPLETE Column stats: NONE                            
                   |
|                           value expressions: _col1 (type: bigint), _col4 
(type: double), _col5 (type: double), _col6 (type: date)                        
                |
|             Execution mode: vectorized, llap                                  
                                                                                
           |
|             LLAP IO: all inputs                                               
                                                                                
           |
|         Map 7                                                                 
                                                                
{code}
Note the Map2 to Map3 output.

This causes a rather large join (120GB) to be categorized as a map-join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to