Hi, Thanks for your comment.
You may misunderstand my motivation. This proposal doesn't change the Apache Arrow columnar format. For example, this proposal doesn't save statistics read from Apache Parquet file to Apache Arrow IPC file. This proposal just attaches statistics read from Apache Parquet file to in-memory arrow::Array C++ objects. It's just for easy to use in-memory arrow::Array C++ objects. This proposal doesn't compute statistics from in-memory arrow::Array C++ objects. (We may want to do it later but this proposal doesn't propose it.) (Does arrow-rs compute statistics from in-memory Arrow array?) Thanks, -- kou In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 6 Jun 2024 08:13:11 +0200, Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > Hi > > This is c++ specific, but imo the question applies more broadly. > > I understood that the rationale for stats in compressed+encoded formats > like parquet is that computing those stats has a high cost (io + decompress > + decode + aggregate). This motivates the materialization of aggregates. > > In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in > the heap) and the cost is thus smaller (aggregate). > > It could be useful to quantify how much is being saved vs how much > complexity is being added to the format + implementations. > > Best, > Jorge > > > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com> wrote: > >> Generally I think this is a good idea that has been proposed before but I >> don't think we could ever make progress on design. >> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote: >> >> > Hi, >> > >> > Related GitHub issue: >> > https://github.com/apache/arrow/issues/41909 >> > >> > How about adding arrow::ArrayStatistics? >> > >> > Motivation: >> > >> > An Apache Arrow format data doesn't have statistics. (We can >> > add statistics as metadata but there isn't any standard way >> > for it.) >> > >> > But a source of an Apache Arrow format data such as Apache >> > Parquet format data may have statistics. We can get the >> > source statistics via source reader such as >> > parquet::ColumnChunkMetaData::statistics() but can't get >> > them from read Apache Arrow format data. If we want to use >> > the source statistics, we need to keep the source reader. >> > >> > Proposal: >> > >> > How about adding arrow::ArrayStatistics or something and >> > attaching source statistics to read arrow::Array? If source >> > statistics are attached to read arrow::Array, we don't need >> > to keep a source reader to get source statistics. >> > >> > What do you think about this idea? >> > >> > >> > NOTE: I haven't thought about the arrow::ArrayStatistics >> > details yet. We'll be able to use parquet::Statistics and >> > its family as a reference. >> > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h >> > >> > >> > Thanks, >> > -- >> > kou >> > >>