>  It could be useful to quantify how much is being saved vs how much
complexity is being added to the format + implementations.

Xiangpeng and I are working on a blog post to quantify this overhead in
parquet-rs -- I'll post it here when ready

Andrew

On Thu, Jun 6, 2024 at 2:13 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi
>
> This is c++ specific, but imo the question applies more broadly.
>
> I understood that the rationale for stats in compressed+encoded formats
> like parquet is that computing those stats has a high cost (io + decompress
> + decode + aggregate). This motivates the materialization of aggregates.
>
> In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in
> the heap) and the cost is thus smaller (aggregate).
>
> It could be useful to quantify how much is being saved vs how much
> complexity is being added to the format + implementations.
>
> Best,
> Jorge
>
>
> On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> > Generally I think this is a good idea that has been proposed before but I
> > don't think we could ever make progress on design.
> >
> > On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:
> >
> > > Hi,
> > >
> > > Related GitHub issue:
> > > https://github.com/apache/arrow/issues/41909
> > >
> > > How about adding arrow::ArrayStatistics?
> > >
> > > Motivation:
> > >
> > > An Apache Arrow format data doesn't have statistics. (We can
> > > add statistics as metadata but there isn't any standard way
> > > for it.)
> > >
> > > But a source of an Apache Arrow format data such as Apache
> > > Parquet format data may have statistics. We can get the
> > > source statistics via source reader such as
> > > parquet::ColumnChunkMetaData::statistics() but can't get
> > > them from read Apache Arrow format data. If we want to use
> > > the source statistics, we need to keep the source reader.
> > >
> > > Proposal:
> > >
> > > How about adding arrow::ArrayStatistics or something and
> > > attaching source statistics to read arrow::Array? If source
> > > statistics are attached to read arrow::Array, we don't need
> > > to keep a source reader to get source statistics.
> > >
> > > What do you think about this idea?
> > >
> > >
> > > NOTE: I haven't thought about the arrow::ArrayStatistics
> > > details yet. We'll be able to use parquet::Statistics and
> > > its family as a reference.
> > > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> >
>

Reply via email to