kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2089902455
Thanks for sharing our talked idea. I took a look at the DuckDB implementation. It seems that DucDB uses only column-level statistics: `duckdb::TableFunction::statistics` returns the statistics of a specified column: https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/function/table_function.hpp#L253-L255 https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/function/table_function.hpp#L188-L189 `duckdb::BaseStatistics` doesn't have row count. It has distinct count, have `NULL` and have non-`NULL`: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146 It seems that a numeric/string column can have min/max statistics: https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/numeric_stats.hpp#L22-L31 https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/string_stats.hpp#L23-L36 (A string column can have more statistics such as have Unicode and max length.) Hmm. It seems that column-level statistics is also needed for real word use cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org