Hi Kou,

I agree that Dewey that this is overstretching the capabilities of the C Data Interface. In particular, stuffing a pointer as metadata value and decreeing it immortal doesn't sound like a good design decision.

Why not simply pass the statistics ArrowArray separately in your producer API of choice (Dewey mentioned ADBC but it is of course just a possible API among others)?

Regards

Antoine.


Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
     "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
     ArrowSchema {
       .name = "column1",
       .format = "i",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
       },
     },
     ArrowSchema {
       .name = "column2",
       .format = "u",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
       },
     },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name     | Field Type                       | Comments |
|----------------|----------------------------------| -------- |
| key            | string not null                  | (1)      |
| value          | `VALUE_SCHEMA` not null          |          |
| is_approximate | bool not null                    | (2)      |

1. We'll provide pre-defined keys such as "max", "min",
    "byte_width" and "distinct_count" but users can also use
    application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type                       | Comments |
|------------|----------------------------------| -------- |
| int64      | int64                            |          |
| uint64     | uint64                           |          |
| float64    | float64                          |          |
| value      | The same type of the ArrowSchema | (3)      |
|            | that is belonged to.             |          |

3. If the ArrowSchema's type is string, this type is also string.

    TODO: Is "value" good name? If we refer it from the
    top-level statistics schema, we need to use
    "value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,

Reply via email to