> It could be useful to quantify how much is being saved vs how much complexity is being added to the format + implementations.
Xiangpeng and I are working on a blog post to quantify this overhead in parquet-rs -- I'll post it here when ready Andrew On Thu, Jun 6, 2024 at 2:13 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi > > This is c++ specific, but imo the question applies more broadly. > > I understood that the rationale for stats in compressed+encoded formats > like parquet is that computing those stats has a high cost (io + decompress > + decode + aggregate). This motivates the materialization of aggregates. > > In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in > the heap) and the cost is thus smaller (aggregate). > > It could be useful to quantify how much is being saved vs how much > complexity is being added to the format + implementations. > > Best, > Jorge > > > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com> wrote: > > > Generally I think this is a good idea that has been proposed before but I > > don't think we could ever make progress on design. > > > > On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote: > > > > > Hi, > > > > > > Related GitHub issue: > > > https://github.com/apache/arrow/issues/41909 > > > > > > How about adding arrow::ArrayStatistics? > > > > > > Motivation: > > > > > > An Apache Arrow format data doesn't have statistics. (We can > > > add statistics as metadata but there isn't any standard way > > > for it.) > > > > > > But a source of an Apache Arrow format data such as Apache > > > Parquet format data may have statistics. We can get the > > > source statistics via source reader such as > > > parquet::ColumnChunkMetaData::statistics() but can't get > > > them from read Apache Arrow format data. If we want to use > > > the source statistics, we need to keep the source reader. > > > > > > Proposal: > > > > > > How about adding arrow::ArrayStatistics or something and > > > attaching source statistics to read arrow::Array? If source > > > statistics are attached to read arrow::Array, we don't need > > > to keep a source reader to get source statistics. > > > > > > What do you think about this idea? > > > > > > > > > NOTE: I haven't thought about the arrow::ArrayStatistics > > > details yet. We'll be able to use parquet::Statistics and > > > its family as a reference. > > > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h > > > > > > > > > Thanks, > > > -- > > > kou > > > > > >