Re: [DISCUSS] Statistics through the C data interface

Antoine Pitrou Wed, 22 May 2024 08:05:42 -0700


Hi Kou,

I agree that Dewey that this is overstretching the capabilities of the CData Interface. In particular, stuffing a pointer as metadata value anddecreeing it immortal doesn't sound like a good design decision.

Why not simply pass the statistics ArrowArray separately in yourproducer API of choice (Dewey mentioned ADBC but it is of course just apossible API among others)?


Regards

Antoine.


Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.


So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
     "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
     ArrowSchema {
       .name = "column1",
       .format = "i",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
       },
     },
     ArrowSchema {
       .name = "column2",
       .format = "u",
       .metadata = {
         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
       },
     },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name     | Field Type                       | Comments |
|----------------|----------------------------------| -------- |
| key            | string not null                  | (1)      |
| value          | `VALUE_SCHEMA` not null          |          |
| is_approximate | bool not null                    | (2)      |

1. We'll provide pre-defined keys such as "max", "min",
    "byte_width" and "distinct_count" but users can also use
    application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type                       | Comments |
|------------|----------------------------------| -------- |
| int64      | int64                            |          |
| uint64     | uint64                           |          |
| float64    | float64                          |          |
| value      | The same type of the ArrowSchema | (3)      |
|            | that is belonged to.             |          |

3. If the ArrowSchema's type is string, this type is also string.

    TODO: Is "value" good name? If we refer it from the
    top-level statistics schema, we need to use
    "value.value". It's a bit strange...


What do you think about this proposal? Could you share your
comments?


Thanks,

Re: [DISCUSS] Statistics through the C data interface

Reply via email to