Mostafa Mokhtar created IMPALA-5851:
---------------------------------------

             Summary: Estimate number of rows for  sum_init_zero scans should 
be number of files not table cardinality
                 Key: IMPALA-5851
                 URL: https://issues.apache.org/jira/browse/IMPALA-5851
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Mostafa Mokhtar
            Priority: Minor


IMPALA-5036 introduced an optimization to use the data stored in the Parquet 
RowGroup.num_rows field for count(*) queries.
The estimate cardinality for the scan is the number of rows in the base table 
opposed to number of files or row groups. 

{code}
+-------------------------------------------------------------------------------+
| Explain String                                                                
|
+-------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B                                  
|
| Per-Host Resource Estimates: Memory=108.00MB                                  
|
|                                                                               
|
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         
|
| |  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B                
|
| PLAN-ROOT SINK                                                                
|
| |  mem-estimate=0B mem-reservation=0B                                         
|
| |                                                                             
|
| 03:AGGREGATE [FINALIZE]                                                       
|
| |  output: count:merge(*)                                                     
|
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                
|
| |  tuple-ids=1 row-size=8B cardinality=1                                      
|
| |                                                                             
|
| 02:EXCHANGE [UNPARTITIONED]                                                   
|
| |  mem-estimate=0B mem-reservation=0B                                         
|
| |  tuple-ids=1 row-size=8B cardinality=1                                      
|
| |                                                                             
|
| F00:PLAN FRAGMENT [RANDOM] hosts=130 instances=130                            
|
| Per-Host Resources: mem-estimate=98.00MB mem-reservation=0B                   
|
| 01:AGGREGATE                                                                  
|
| |  output: sum_init_zero(tpch_30000_parquet.lineitem.parquet-stats: num_rows) 
|
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                
|
| |  tuple-ids=1 row-size=8B cardinality=1                                      
|
| |                                                                             
|
| 00:SCAN HDFS [tpch_30000_parquet.lineitem, RANDOM]                            
|
|    partitions=2526/2526 files=28976 size=6.89TB                               
|
|    stats-rows=179999978268 extrapolated-rows=disabled                         
|
|    table stats: rows=179999978268 size=unavailable                            
|
|    column stats: all                                                          
|
|    mem-estimate=88.00MB mem-reservation=0B                                    
|
|    tuple-ids=0 row-size=8B cardinality=179999978268                           
|
+-------------------------------------------------------------------------------+
{code}

{code}
+--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  
| Est. Peak Mem | Detail                      |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
| 03:AGGREGATE | 1      | 1.28ms   | 1.28ms   | 1      | 1          | 532.00 KB 
| 10.00 MB      | FINALIZE                    |
| 02:EXCHANGE  | 1      | 2.56s    | 2.56s    | 129    | 1          | 0 B       
| 0 B           | UNPARTITIONED               |
| 01:AGGREGATE | 129    | 4.89ms   | 62.84ms  | 129    | 1          | 20.00 KB  
| 10.00 MB      |                             |
| 00:SCAN HDFS | 129    | 62.44ms  | 341.03ms | 28.98K | 180.00B    | 1.75 MB   
| 88.00 MB      | tpch_30000_parquet.lineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to