Re: [DISCUSS] Statistics through the C data interface

Raphael Taylor-Davies Wed, 22 May 2024 04:15:20 -0700

Hi,

One potential challenge with encoding statistics in the schema metadatais that some systems may consider this metadata as part of assessingschema equivalence.

However, I think the bigger question is what the intended use-case forthese statistics is? Often query engines want to collect statistics frommultiple containers in one go, as this allows for efficient vectorisedpruning across multiple files, row groups, etc... I therefore wonder ifthe solution is simply to return separate arrays of min, max, etc...potentially even grouped together into a single StructArray?

This would have the benefit of not needing specification changes, whilstbeing significantly more efficient than an approach centered on scalarstatistics. FWIW this is the approach taken by DataFusion for pruningstatistics [1], and in arrow-rs we represent scalars as arrays to avoidneeding to define a parallel serialization standard [2].


Kind Regards,

Raphael

[1]:https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html

[2]: https://github.com/apache/arrow-rs/pull/4393

On 22/05/2024 03:37, Sutou Kouhei wrote:

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
     "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
     ArrowSchema {
       .name = "column1",
       .format = "i",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
       },
     },
     ArrowSchema {
       .name = "column2",
       .format = "u",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
       },
     },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name     | Field Type                       | Comments |
|----------------|----------------------------------| -------- |
| key            | string not null                  | (1)      |
| value          | `VALUE_SCHEMA` not null          |          |
| is_approximate | bool not null                    | (2)      |

1. We'll provide pre-defined keys such as "max", "min",
    "byte_width" and "distinct_count" but users can also use
    application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type                       | Comments |
|------------|----------------------------------| -------- |
| int64      | int64                            |          |
| uint64     | uint64                           |          |
| float64    | float64                          |          |
| value      | The same type of the ArrowSchema | (3)      |
|            | that is belonged to.             |          |

3. If the ArrowSchema's type is string, this type is also string.

    TODO: Is "value" good name? If we refer it from the
    top-level statistics schema, we need to use
    "value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,

Re: [DISCUSS] Statistics through the C data interface

Reply via email to