Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Sutou Kouhei Sun, 09 Jun 2024 00:40:28 -0700

Hi,

Thanks for your comment.


You may misunderstand my motivation.

This proposal doesn't change the Apache Arrow columnar
format. For example, this proposal doesn't save statistics
read from Apache Parquet file to Apache Arrow IPC file. This
proposal just attaches statistics read from Apache Parquet
file to in-memory arrow::Array C++ objects. It's just for
easy to use in-memory arrow::Array C++ objects.

This proposal doesn't compute statistics from in-memory
arrow::Array C++ objects. (We may want to do it later but
this proposal doesn't propose it.)

(Does arrow-rs compute statistics from in-memory Arrow
array?)


Thanks,
-- 
kou

In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com>
  "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, 6 Jun 
2024 08:13:11 +0200,
  Jorge Cardoso Leitão <[email protected]> wrote:

> Hi
> 
> This is c++ specific, but imo the question applies more broadly.
> 
> I understood that the rationale for stats in compressed+encoded formats
> like parquet is that computing those stats has a high cost (io + decompress
> + decode + aggregate). This motivates the materialization of aggregates.
> 
> In arrow the data is already in an in-memory format (e.g. IPC+mmap, or in
> the heap) and the cost is thus smaller (aggregate).
> 
> It could be useful to quantify how much is being saved vs how much
> complexity is being added to the format + implementations.
> 
> Best,
> Jorge
> 
> 
> On Thu, Jun 6, 2024, 07:55 Micah Kornfield <[email protected]> wrote:
> 
>> Generally I think this is a good idea that has been proposed before but I
>> don't think we could ever make progress on design.
>>
>> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Related GitHub issue:
>> > https://github.com/apache/arrow/issues/41909
>> >
>> > How about adding arrow::ArrayStatistics?
>> >
>> > Motivation:
>> >
>> > An Apache Arrow format data doesn't have statistics. (We can
>> > add statistics as metadata but there isn't any standard way
>> > for it.)
>> >
>> > But a source of an Apache Arrow format data such as Apache
>> > Parquet format data may have statistics. We can get the
>> > source statistics via source reader such as
>> > parquet::ColumnChunkMetaData::statistics() but can't get
>> > them from read Apache Arrow format data. If we want to use
>> > the source statistics, we need to keep the source reader.
>> >
>> > Proposal:
>> >
>> > How about adding arrow::ArrayStatistics or something and
>> > attaching source statistics to read arrow::Array? If source
>> > statistics are attached to read arrow::Array, we don't need
>> > to keep a source reader to get source statistics.
>> >
>> > What do you think about this idea?
>> >
>> >
>> > NOTE: I haven't thought about the arrow::ArrayStatistics
>> > details yet. We'll be able to use parquet::Statistics and
>> > its family as a reference.
>> > https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
>> >
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>>

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Reply via email to