NGA-TRAN commented on issue #8099: URL: https://github.com/apache/arrow-datafusion/issues/8099#issuecomment-1806419419
I used the change in this PR https://github.com/apache/arrow-datafusion/pull/8112/ to see stats of columns. Summary: 1. Lowest scan step: column only has stats `Null`. No stats on min and max. 2. `SortExec` keeps stats as its input 3. `FilterExec` on `string` and `timestamp` lose stats on everything including table-level stats 4. `FilterExec` on `integer` and `float` make: (1) table-level stats become `inexact`, (2) column stats of columns in the filter start having `Inexact` min and max stats, (3) column stats of other columns start having `Exact` but wrong stats 5. `AggregateExec` make: (1) row count become `Inexact`, (2) lost stats for bytes and columns ```SQL ❯ set datafusion.explain.show_statistics = true; 0 rows in set. Query took 0.009 seconds. ❯ set datafusion.execution.collect_statistics = true; 0 rows in set. Query took 0.001 seconds. ❯ create table t1(state string, city string, min_temp float, area int, time timestamp) as values ('MA', 'Boston', 70.4, 1, 50), ('MA', 'Bedford', 71.59, 2, 150), ('CA', 'SF', 79.0, 1, 300), ('MA', 'Boston', 75.4, 3, 250), ('MA', 'Andover', 69.5, 4, 250), ('MA', 'Bedford', 78.2, 2, 150), ('MA', 'Boston', 65.0, 2, 250), ('CA', 'SJ', 78.4, 1, 300), ('MA', 'Reading', 53.0, 4, 250), ('CA', 'SJ', 75.4, 5, 350); 0 rows in set. Query took 0.013 seconds. ❯ explain select * from t1; +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | TableScan: t1 projection=[state, city, min_temp, area, time] | | physical_plan | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | | +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.004 seconds. ❯ explain select * from t1 order by time DESC; +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: t1.time DESC NULLS FIRST | | | TableScan: t1 projection=[state, city, min_temp, area, time] | | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | | +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.008 seconds. ❯ explain select * from t1 where time > to_timestamp(350) order by time DESC; +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: t1.time DESC NULLS FIRST | | | Filter: t1.time > TimestampNanosecond(350000000000, None) | | | TableScan: t1 projection=[state, city, min_temp, area, time] | | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] | | | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] | | | FilterExec: time@4 > 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] | | | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | | +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.006 seconds. ❯ explain select * from t1 where city = 'Boston' order by time DESC; +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: t1.time DESC NULLS FIRST | | | Filter: t1.city = Utf8("Boston") | | | TableScan: t1 projection=[state, city, min_temp, area, time] | | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] | | | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] | | | FilterExec: city@1 = Boston, statistics=[Rows=Absent, Bytes=Absent, [(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:)]] | | | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | | +---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.006 seconds. ❯ explain select * from t1 where min_temp > 10.0 order by time DESC; +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: t1.time DESC NULLS FIRST | | | Filter: CAST(t1.min_temp AS Float64) > Float64(10) | | | TableScan: t1 projection=[state, city, min_temp, area, time] | | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] | | | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] | | | FilterExec: CAST(min_temp@2 AS Float64) > 10, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Inexact(Float32(10.000001)) Max=Inexact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Exact(Int32(NULL)) Max=Exact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] | | | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | | +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.009 seconds. ❯ explain select * from t1 where area > 10 order by time DESC; +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: t1.time DESC NULLS FIRST | | | Filter: t1.area > Int32(10) | | | TableScan: t1 projection=[state, city, min_temp, area, time] | | physical_plan | SortExec: expr=[time@4 DESC], statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] | | | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] | | | FilterExec: area@3 > 10, statistics=[Rows=Inexact(10), Bytes=Inexact(2960), [(Col[0]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[1]: Min=Exact(Utf8(NULL)) Max=Exact(Utf8(NULL)) Null=Inexact(0)),(Col[2]: Min=Exact(Float32(NULL)) Max=Exact(Float32(NULL)) Null=Inexact(0)),(Col[3]: Min=Inexact(Int32(11)) Max=Inexact(Int32(NULL)) Null=Inexact(0)),(Col[4]: Min=Exact(TimestampNanosecond(NULL, None)) Max=Exact(TimestampNanosecond(NULL, None)) Null=Inexact(0))]] | | | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0)),(Col[2]: Null=Exact(0)),(Col[3]: Null=Exact(0)),(Col[4]: Null=Exact(0))]] | | | | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.009 seconds. ❯ explain select city, min(min_temp) from t1 group by city; +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[t1.city]], aggr=[[MIN(t1.min_temp)]] | | | TableScan: t1 projection=[city, min_temp] | | physical_plan | AggregateExec: mode=FinalPartitioned, gby=[city@0 as city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] | | | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] | | | RepartitionExec: partitioning=Hash([city@0], 10), input_partitions=10, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] | | | RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1, statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] | | | AggregateExec: mode=Partial, gby=[city@0 as city], aggr=[MIN(t1.min_temp)], statistics=[Rows=Inexact(10), Bytes=Absent, [(Col[0]:),(Col[1]:)]] | | | MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960), [(Col[0]: Null=Exact(0)),(Col[1]: Null=Exact(0))]] | | | | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set. Query took 0.007 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
