I believe DuckDB has their own custom parquet implementation[1]. [1]: https://github.com/duckdb/duckdb/blob/26cb7178fd89f924a936874e5c09ec1f6df8a0a4/extension/parquet/parquet_extension.cpp#L88
On Tue, Jan 14, 2025 at 3:11 PM Steve Loughran <ste...@cloudera.com.invalid> wrote: > Is this the library used by DuckDB? As I've heard that it doesn't add > statistics to parquet files, which is unfortunate > > On Tue, 14 Jan 2025 at 15:13, Andrew Lamb <andrewlam...@gmail.com> wrote: > > > I believe Ed added these statistics into parquet-rs[1] as well. We have > > also enabled them by default and haven't seen any performance issues. > > > > Andrew > > > > [1] https://github.com/apache/arrow-rs/pull/6105 > > > > On Tue, Jan 14, 2025 at 9:38 AM Gang Wu <ust...@gmail.com> wrote: > > > > > Hi, > > > > > > The C++ Parquet implementation in the Apache Arrow (namely the > > parquet-cpp) > > > has > > > added Page Index support since 13.0.0. Recently SizeStatistics support > is > > > also > > > added in 19.0.0. Both features are disabled by default. We did a > > benchmark > > > and > > > the result showed that we can enable them by default with acceptable > > > penalties. > > > Therefore I opened a PR [1] to turn on them by default. The benchmark > > > result > > > is also available in this PR. Any feedback is welcome. If there is no > > > objection, > > > we will merge this PR and release it with Apache Arrow 20.0.0. > > > > > > [1] https://github.com/apache/arrow/pull/45249 > > > > > > Best, > > > Gang > > > > > >