Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Sutou Kouhei Tue, 11 Jun 2024 00:49:06 -0700

Hi,

Thanks for sharing arrow-rs related information!


> 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP but
> I plan to propose upstreaming[4] to arrow-rs when complete)

Interesting. A separated but related discussion[10] will use
a record batch or a map array for statistics and it includes
multiple statistic items. But arrow-rs (DataFusion) uses an
Arrow array per statistic item.

It seems that datafunsion::common::Statistics[2] (this is a
higher level statistics object, right?) doesn't use Arrow
arrays. When extracted Parquet statistics are converted to
datafunsion::common::Statistics from Arrow arrays in
DataFusion? (It's WIP?)


[10] https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx


Thanks,
-- 
kou

In <CAFhtnRwgixMc_vFmogNZ7V=46mjg53md+dsftcw_c5qvyhs...@mail.gmail.com>
  "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Mon, 10 Jun 
2024 11:26:23 -0400,
  Andrew Lamb <al...@influxdata.com> wrote:

>> (Does arrow-rs compute statistics from in-memory Arrow array?)
> 
> Not really, though there are kernels[1] to do so for some types
> 
> We have two related concepts in the Rust ecosystem:
> 1. Full on statistics in DataFusion [2] (though no great way at the moment)
> 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP but
> I plan to propose upstreaming[4] to arrow-rs when complete)
> 
> I think that  code to extract statistics from Parquet files as arrow arrays
> is a very important feature (and lets query engines do row group and data
> page prunng).
> 
> The value of a  higher level Statistics object is a little less clear to me
> -- query engines end up with all sorts of complicated calculations on such
> objects (like predicate selectivity, NDV estimation, boundary analysis,
> etc) that finding what level makes sense in arrow might be hard.
> 
> Andrew
> 
> [1]: https://docs.rs/arrow/latest/arrow/compute/fn.min.html
> [2]:
> https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html
> [3]:
> https://github.com/apache/datafusion/blob/e094f94d2a3f23128ce782a20982dbf7ac1ebed2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L579
> [4]: https://github.com/apache/arrow-rs/issues/4328
> 
> On Sun, Jun 9, 2024 at 3:40 AM Sutou Kouhei <k...@clear-code.com> wrote:
> 
>> Hi,
>>
>> Thanks for your comment.
>>
>> You may misunderstand my motivation.
>>
>> This proposal doesn't change the Apache Arrow columnar
>> format. For example, this proposal doesn't save statistics
>> read from Apache Parquet file to Apache Arrow IPC file. This
>> proposal just attaches statistics read from Apache Parquet
>> file to in-memory arrow::Array C++ objects. It's just for
>> easy to use in-memory arrow::Array C++ objects.
>>
>> This proposal doesn't compute statistics from in-memory
>> arrow::Array C++ objects. (We may want to do it later but
>> this proposal doesn't propose it.)
>>
>> (Does arrow-rs compute statistics from in-memory Arrow
>> array?)
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com>
>>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 6
>> Jun 2024 08:13:11 +0200,
>>   Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote:
>>
>> > Hi
>> >
>> > This is c++ specific, but imo the question applies more broadly.
>> >
>> > I understood that the rationale for stats in compressed+encoded formats
>> > like parquet is that computing those stats has a high cost (io +
>> decompress
>> > + decode + aggregate). This motivates the materialization of aggregates.
>> >
>> > In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in
>> > the heap) and the cost is thus smaller (aggregate).
>> >
>> > It could be useful to quantify how much is being saved vs how much
>> > complexity is being added to the format + implementations.
>> >
>> > Best,
>> > Jorge
>> >
>> >
>> > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >
>> >> Generally I think this is a good idea that has been proposed before but
>> I
>> >> don't think we could ever make progress on design.
>> >>
>> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > Related GitHub issue:
>> >> > https://github.com/apache/arrow/issues/41909
>> >> >
>> >> > How about adding arrow::ArrayStatistics?
>> >> >
>> >> > Motivation:
>> >> >
>> >> > An Apache Arrow format data doesn't have statistics. (We can
>> >> > add statistics as metadata but there isn't any standard way
>> >> > for it.)
>> >> >
>> >> > But a source of an Apache Arrow format data such as Apache
>> >> > Parquet format data may have statistics. We can get the
>> >> > source statistics via source reader such as
>> >> > parquet::ColumnChunkMetaData::statistics() but can't get
>> >> > them from read Apache Arrow format data. If we want to use
>> >> > the source statistics, we need to keep the source reader.
>> >> >
>> >> > Proposal:
>> >> >
>> >> > How about adding arrow::ArrayStatistics or something and
>> >> > attaching source statistics to read arrow::Array? If source
>> >> > statistics are attached to read arrow::Array, we don't need
>> >> > to keep a source reader to get source statistics.
>> >> >
>> >> > What do you think about this idea?
>> >> >
>> >> >
>> >> > NOTE: I haven't thought about the arrow::ArrayStatistics
>> >> > details yet. We'll be able to use parquet::Statistics and
>> >> > its family as a reference.
>> >> >
>> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>> >> >
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > kou
>> >> >
>> >>
>>

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Reply via email to