Hi, Thanks for sharing arrow-rs related information!
> 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP but > I plan to propose upstreaming[4] to arrow-rs when complete) Interesting. A separated but related discussion[10] will use a record batch or a map array for statistics and it includes multiple statistic items. But arrow-rs (DataFusion) uses an Arrow array per statistic item. It seems that datafunsion::common::Statistics[2] (this is a higher level statistics object, right?) doesn't use Arrow arrays. When extracted Parquet statistics are converted to datafunsion::common::Statistics from Arrow arrays in DataFusion? (It's WIP?) [10] https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx Thanks, -- kou In <CAFhtnRwgixMc_vFmogNZ7V=46mjg53md+dsftcw_c5qvyhs...@mail.gmail.com> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Mon, 10 Jun 2024 11:26:23 -0400, Andrew Lamb <al...@influxdata.com> wrote: >> (Does arrow-rs compute statistics from in-memory Arrow array?) > > Not really, though there are kernels[1] to do so for some types > > We have two related concepts in the Rust ecosystem: > 1. Full on statistics in DataFusion [2] (though no great way at the moment) > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP but > I plan to propose upstreaming[4] to arrow-rs when complete) > > I think that code to extract statistics from Parquet files as arrow arrays > is a very important feature (and lets query engines do row group and data > page prunng). > > The value of a higher level Statistics object is a little less clear to me > -- query engines end up with all sorts of complicated calculations on such > objects (like predicate selectivity, NDV estimation, boundary analysis, > etc) that finding what level makes sense in arrow might be hard. > > Andrew > > [1]: https://docs.rs/arrow/latest/arrow/compute/fn.min.html > [2]: > https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html > [3]: > https://github.com/apache/datafusion/blob/e094f94d2a3f23128ce782a20982dbf7ac1ebed2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L579 > [4]: https://github.com/apache/arrow-rs/issues/4328 > > On Sun, Jun 9, 2024 at 3:40 AM Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> Thanks for your comment. >> >> You may misunderstand my motivation. >> >> This proposal doesn't change the Apache Arrow columnar >> format. For example, this proposal doesn't save statistics >> read from Apache Parquet file to Apache Arrow IPC file. This >> proposal just attaches statistics read from Apache Parquet >> file to in-memory arrow::Array C++ objects. It's just for >> easy to use in-memory arrow::Array C++ objects. >> >> This proposal doesn't compute statistics from in-memory >> arrow::Array C++ objects. (We may want to do it later but >> this proposal doesn't propose it.) >> >> (Does arrow-rs compute statistics from in-memory Arrow >> array?) >> >> >> Thanks, >> -- >> kou >> >> In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com> >> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 6 >> Jun 2024 08:13:11 +0200, >> Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: >> >> > Hi >> > >> > This is c++ specific, but imo the question applies more broadly. >> > >> > I understood that the rationale for stats in compressed+encoded formats >> > like parquet is that computing those stats has a high cost (io + >> decompress >> > + decode + aggregate). This motivates the materialization of aggregates. >> > >> > In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in >> > the heap) and the cost is thus smaller (aggregate). >> > >> > It could be useful to quantify how much is being saved vs how much >> > complexity is being added to the format + implementations. >> > >> > Best, >> > Jorge >> > >> > >> > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> >> Generally I think this is a good idea that has been proposed before but >> I >> >> don't think we could ever make progress on design. >> >> >> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote: >> >> >> >> > Hi, >> >> > >> >> > Related GitHub issue: >> >> > https://github.com/apache/arrow/issues/41909 >> >> > >> >> > How about adding arrow::ArrayStatistics? >> >> > >> >> > Motivation: >> >> > >> >> > An Apache Arrow format data doesn't have statistics. (We can >> >> > add statistics as metadata but there isn't any standard way >> >> > for it.) >> >> > >> >> > But a source of an Apache Arrow format data such as Apache >> >> > Parquet format data may have statistics. We can get the >> >> > source statistics via source reader such as >> >> > parquet::ColumnChunkMetaData::statistics() but can't get >> >> > them from read Apache Arrow format data. If we want to use >> >> > the source statistics, we need to keep the source reader. >> >> > >> >> > Proposal: >> >> > >> >> > How about adding arrow::ArrayStatistics or something and >> >> > attaching source statistics to read arrow::Array? If source >> >> > statistics are attached to read arrow::Array, we don't need >> >> > to keep a source reader to get source statistics. >> >> > >> >> > What do you think about this idea? >> >> > >> >> > >> >> > NOTE: I haven't thought about the arrow::ArrayStatistics >> >> > details yet. We'll be able to use parquet::Statistics and >> >> > its family as a reference. >> >> > >> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h >> >> > >> >> > >> >> > Thanks, >> >> > -- >> >> > kou >> >> > >> >> >>