NGA-TRAN commented on issue #8099:
URL: 
https://github.com/apache/arrow-datafusion/issues/8099#issuecomment-1806419419

   I used the change in this PR 
https://github.com/apache/arrow-datafusion/pull/8112/ to see stats of columns. 
   Summary:
     1. Lowest scan step: column only has stats `Null`. No stats on min and max.
     2. `SortExec` keeps stats as its input
     3. `FilterExec` on `string` and `timestamp` lose stats on everything 
including table-level stats
     4. `FilterExec` on `integer` and `float` make: (1) table-level stats  
become `inexact`, (2) column stats of columns in the filter start having 
`Inexact` min and max stats, (3) column stats of other columns start having 
`Exact` but wrong stats
     5. `AggregateExec` make: (1) row count become `Inexact`, (2) lost stats 
for bytes and columns
     
   
   
   ```SQL
   ❯ set datafusion.explain.show_statistics = true;
   0 rows in set. Query took 0.009 seconds.
   
   ❯ set datafusion.execution.collect_statistics = true;
   0 rows in set. Query took 0.001 seconds.
   
   ❯ 
   create table t1(state string, city string, min_temp float, area int, time 
timestamp) as values 
       ('MA', 'Boston', 70.4, 1, 50),
       ('MA', 'Bedford', 71.59, 2, 150),
       ('CA', 'SF', 79.0, 1, 300),
       ('MA', 'Boston', 75.4, 3, 250),
       ('MA', 'Andover', 69.5, 4, 250),
       ('MA', 'Bedford', 78.2, 2, 150),
       ('MA', 'Boston', 65.0, 2, 250),
       ('CA', 'SJ', 78.4, 1, 300),
       ('MA', 'Reading', 53.0, 4, 250),
       ('CA', 'SJ', 75.4, 5, 350);
   0 rows in set. Query took 0.013 seconds.
   
   ❯ explain select * from t1;
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                              |
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | TableScan: t1 projection=[state, city, min_temp, area, 
time]                                                                           
                                                                                
  |
   | physical_plan | MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: 
Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
   |               |                                                            
                                                                                
                                                                              |
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.004 seconds.
   
   ❯ explain select * from t1 order by time DESC;
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
|
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: t1.time DESC NULLS FIRST                             
                                                                                
                                                                                
|
   |               |   TableScan: t1 projection=[state, city, min_temp, area, 
time]                                                                           
                                                                                
  |
   | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Exact(10), 
Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: 
Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]]                
    |
   |               |   MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: 
Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
   |               |                                                            
                                                                                
                                                                                
|
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.008 seconds.
   
   ❯ explain select * from t1 where time > to_timestamp(350) order by time DESC;
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
    |
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: t1.time DESC NULLS FIRST                             
                                                                                
                                                                                
    |
   |               |   Filter: t1.time > TimestampNanosecond(350000000000, 
None)                                                                           
                                                                                
         |
   |               |     TableScan: t1 projection=[state, city, min_temp, area, 
time]                                                                           
                                                                                
    |
   | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, 
Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]              
                                                                                
        |
   |               |   CoalesceBatchesExec: target_batch_size=8192, 
statistics=[Rows=Absent, Bytes=Absent, 
[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                            
                                                         |
   |               |     FilterExec: time@4 > 350000000000, 
statistics=[Rows=Absent, Bytes=Absent, 
[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                            
                                                                 |
   |               |       MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: 
Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
   |               |                                                            
                                                                                
                                                                                
    |
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.006 seconds.
   
   ❯ explain select * from t1 where city = 'Boston' order by time DESC;
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
    |
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: t1.time DESC NULLS FIRST                             
                                                                                
                                                                                
    |
   |               |   Filter: t1.city = Utf8("Boston")                         
                                                                                
                                                                                
    |
   |               |     TableScan: t1 projection=[state, city, min_temp, area, 
time]                                                                           
                                                                                
    |
   | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, 
Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]              
                                                                                
        |
   |               |   CoalesceBatchesExec: target_batch_size=8192, 
statistics=[Rows=Absent, Bytes=Absent, 
[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]                            
                                                         |
   |               |     FilterExec: city@1 = Boston, statistics=[Rows=Absent, 
Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]]              
                                                                                
     |
   |               |       MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: 
Null=Exact(0)),(Col[4]: Null=Exact(0))]] |
   |               |                                                            
                                                                                
                                                                                
    |
   
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.006 seconds.
   
   ❯ explain select * from t1 where min_temp > 10.0 order by time DESC;
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: t1.time DESC NULLS FIRST                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |
   |               |   Filter: CAST(t1.min_temp AS Float64) > Float64(10)       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |
   |               |     TableScan: t1 projection=[state, city, min_temp, area, 
time]                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |
   | physical_plan | SortExec: expr=[time@4 DESC], 
statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: 
Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) 
Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) 
Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) 
Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]                   
  |
   |               |   CoalesceBatchesExec: target_batch_size=8192, 
statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: 
Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) 
Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) 
Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) 
Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]    |
   |               |     FilterExec: CAST(min_temp@2 AS Float64) > 10, 
statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: 
Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) 
Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) 
Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) 
Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] |
   |               |       MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: 
Null=Exact(0)),(Col[4]: Null=Exact(0))]]                                        
                                                                                
                                                                                
                                                                                
           |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.009 seconds.
   
   ❯ explain select * from t1 where area > 10 order by time DESC;
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Sort: t1.time DESC NULLS FIRST                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
   |               |   Filter: t1.area > Int32(10)                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
   |               |     TableScan: t1 projection=[state, city, min_temp, area, 
time]                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
   | physical_plan | SortExec: expr=[time@4 DESC], 
statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: 
Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: 
Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: 
Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, 
None)) Null=Inexact(0))]]                  |
   |               |   CoalesceBatchesExec: target_batch_size=8192, 
statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: 
Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: 
Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: 
Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: 
Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, 
None)) Null=Inexact(0))]] |
   |               |     FilterExec: area@3 > 10, statistics=[Rows=Inexact(10), 
Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) 
Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) 
Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) 
Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) 
Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) 
Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]]                   
|
   |               |       MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: 
Null=Exact(0)),(Col[4]: Null=Exact(0))]]                                        
                                                                                
                                                                                
                                                                                
 |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.009 seconds.
   
   ❯  explain select city, min(min_temp) from t1 group by city;
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[t1.city]], aggr=[[MIN(t1.min_temp)]]  
                                                                                
                |
   |               |   TableScan: t1 projection=[city, min_temp]                
                                                                                
                |
   | physical_plan | AggregateExec: mode=FinalPartitioned, gby=[city@0 as 
city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, 
[(Col[0]:),(Col[1]:)]]    |
   |               |   CoalesceBatchesExec: target_batch_size=8192, 
statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]]              
                            |
   |               |     RepartitionExec: partitioning=Hash([city@0], 10), 
input_partitions=10, statistics=[Rows=Inexact(10), Bytes=Absent, 
[(Col[0]:),(Col[1]:)]]              |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(10), 
input_partitions=1, statistics=[Rows=Inexact(10), Bytes=Absent, 
[(Col[0]:),(Col[1]:)]]            |
   |               |         AggregateExec: mode=Partial, gby=[city@0 as city], 
aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, 
[(Col[0]:),(Col[1]:)]]     |
   |               |           MemoryExec: partitions=1, partition_sizes=[1], 
statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: 
Null=Exact(0)),(Col[1]: Null=Exact(0))]] |
   |               |                                                            
                                                                                
                |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.007 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to