> (Does arrow-rs compute statistics from in-memory Arrow array?)

Not really, though there are kernels[1] to do so for some types

We have two related concepts in the Rust ecosystem:
1. Full on statistics in DataFusion [2] (though no great way at the moment)
2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP but
I plan to propose upstreaming[4] to arrow-rs when complete)

I think that  code to extract statistics from Parquet files as arrow arrays
is a very important feature (and lets query engines do row group and data
page prunng).

The value of a  higher level Statistics object is a little less clear to me
-- query engines end up with all sorts of complicated calculations on such
objects (like predicate selectivity, NDV estimation, boundary analysis,
etc) that finding what level makes sense in arrow might be hard.

Andrew

[1]: https://docs.rs/arrow/latest/arrow/compute/fn.min.html
[2]:
https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html
[3]:
https://github.com/apache/datafusion/blob/e094f94d2a3f23128ce782a20982dbf7ac1ebed2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L579
[4]: https://github.com/apache/arrow-rs/issues/4328

On Sun, Jun 9, 2024 at 3:40 AM Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
>
> Thanks for your comment.
>
> You may misunderstand my motivation.
>
> This proposal doesn't change the Apache Arrow columnar
> format. For example, this proposal doesn't save statistics
> read from Apache Parquet file to Apache Arrow IPC file. This
> proposal just attaches statistics read from Apache Parquet
> file to in-memory arrow::Array C++ objects. It's just for
> easy to use in-memory arrow::Array C++ objects.
>
> This proposal doesn't compute statistics from in-memory
> arrow::Array C++ objects. (We may want to do it later but
> this proposal doesn't propose it.)
>
> (Does arrow-rs compute statistics from in-memory Arrow
> array?)
>
>
> Thanks,
> --
> kou
>
> In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com>
>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 6
> Jun 2024 08:13:11 +0200,
>   Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote:
>
> > Hi
> >
> > This is c++ specific, but imo the question applies more broadly.
> >
> > I understood that the rationale for stats in compressed+encoded formats
> > like parquet is that computing those stats has a high cost (io +
> decompress
> > + decode + aggregate). This motivates the materialization of aggregates.
> >
> > In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in
> > the heap) and the cost is thus smaller (aggregate).
> >
> > It could be useful to quantify how much is being saved vs how much
> > complexity is being added to the format + implementations.
> >
> > Best,
> > Jorge
> >
> >
> > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> >> Generally I think this is a good idea that has been proposed before but
> I
> >> don't think we could ever make progress on design.
> >>
> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > Related GitHub issue:
> >> > https://github.com/apache/arrow/issues/41909
> >> >
> >> > How about adding arrow::ArrayStatistics?
> >> >
> >> > Motivation:
> >> >
> >> > An Apache Arrow format data doesn't have statistics. (We can
> >> > add statistics as metadata but there isn't any standard way
> >> > for it.)
> >> >
> >> > But a source of an Apache Arrow format data such as Apache
> >> > Parquet format data may have statistics. We can get the
> >> > source statistics via source reader such as
> >> > parquet::ColumnChunkMetaData::statistics() but can't get
> >> > them from read Apache Arrow format data. If we want to use
> >> > the source statistics, we need to keep the source reader.
> >> >
> >> > Proposal:
> >> >
> >> > How about adding arrow::ArrayStatistics or something and
> >> > attaching source statistics to read arrow::Array? If source
> >> > statistics are attached to read arrow::Array, we don't need
> >> > to keep a source reader to get source statistics.
> >> >
> >> > What do you think about this idea?
> >> >
> >> >
> >> > NOTE: I haven't thought about the arrow::ArrayStatistics
> >> > details yet. We'll be able to use parquet::Statistics and
> >> > its family as a reference.
> >> >
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
> >> >
> >> >
> >> > Thanks,
> >> > --
> >> > kou
> >> >
> >>
>

Reply via email to