kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2089902455

   Thanks for sharing our talked idea.
   
   I took a look at the DuckDB implementation. It seems that DucDB uses only 
column-level statistics:
   
   `duckdb::TableFunction::statistics` returns the statistics of a specified 
column:
   
   
https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/function/table_function.hpp#L253-L255
   
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/function/table_function.hpp#L188-L189
   
   `duckdb::BaseStatistics` doesn't have row count. It has distinct count, have 
`NULL` and  have non-`NULL`:
   
   
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
   
   It seems that a numeric/string column can have min/max statistics:
   
   
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/numeric_stats.hpp#L22-L31
   
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/string_stats.hpp#L23-L36
   
   (A string column can have more statistics such as have Unicode and max 
length.)
   
   Hmm. It seems that column-level statistics is also needed for real word use 
cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to