Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Thu, 06 Jun 2024 02:27:53 -0700

Hi,

Thanks for sharing your comments. Here is a summary so far:


----

Use cases:

* Optimize query plan: e.g. JOIN for DuckDB

Out of scope:

* Transmit statistics through not the C data interface
  Examples:
  * Transmit statistics through Apache Arrow IPC file
  * Transmit statistics through Apache Arrow Flight

Candidate approaches:

1. Pass statistics (encoded as an Apache Arrow data) via
   ArrowSchema metadata
   * This embeds statistics address into metadata
   * It's for avoiding using Apache Arrow IPC format with
     the C data interface
2. Embed statistics (encoded as an Apache Arrow data) into
   ArrowSchema metadata
   * This adds statistics to metadata in Apache Arrow IPC
     format
3. Embed statistics (encoded as JSON) into ArrowArray
   metadata
4. Standardize Apache Arrow schema for statistics and
   transmit statistics via separated API call that uses the
   C data interface
5. Use ADBC

----

I think that 4. is the best approach in these candidates.

1. Embedding statistics address is tricky.
2. Consumers need to parse Apache Arrow IPC format data.
   (The C data interface consumers may not have the
   feature.)
3. This will work but 4. is more generic.
5. ADBC is too large to use only for statistics.

What do you think about this?


If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?

----
Metadata:

| Name                       | Value | Comments |
|----------------------------|-------|--------- |
| ARROW::statistics::version | 1.0.0 | (1)      |

(1) This follows semantic versioning.

Fields:

| Name           | Type                  | Comments |
|----------------|-----------------------| -------- |
| column         | utf8                  | (2)      |
| key            | utf8 not null         | (3)      |
| value          | VALUE_SCHEMA not null |          |
| is_approximate | bool not null         | (4)      |

(2) If null, then the statistic applies to the entire table.
    It's for "row_count".
(3) We'll provide pre-defined keys such as "max", "min",
    "byte_width" and "distinct_count" but users can also use
    application specific keys.
(4) If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Name    | Type    |
|---------|---------|
| int64   | int64   |
| uint64  | uint64  |
| float64 | float64 |
| binary  | binary  |

If a column is an int32 column, it uses int64 for
"max"/"min". We don't provide all types here. Users should
use a compatible type (int64 for a int32 column) instead.
----


Thanks,
-- 
kou


In <20240522.113708.2023905028549001143....@clear-code.com>
  "[DISCUSS] Statistics through the C data interface" on Wed, 22 May 2024 
11:37:08 +0900 (JST),
  Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
> 
> We're discussing how to provide statistics through the C
> data interface at:
> https://github.com/apache/arrow/issues/38837
> 
> If you're interested in this feature, could you share your
> comments?
> 
> 
> Motivation:
> 
> We can interchange Apache Arrow data by the C data interface
> in the same process. For example, we can pass Apache Arrow
> data read by Apache Arrow C++ (provider) to DuckDB
> (consumer) through the C data interface.
> 
> A provider may know Apache Arrow data statistics. For
> example, a provider can know statistics when it reads Apache
> Parquet data because Apache Parquet may provide statistics.
> 
> But a consumer can't know statistics that are known by a
> producer. Because there isn't a standard way to provide
> statistics through the C data interface. If a consumer can
> know statistics, it can process Apache Arrow data faster
> based on statistics.
> 
> 
> Proposal:
> 
> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
> 
> How about providing statistics as a metadata in ArrowSchema?
> 
> We reserve "ARROW" namespace for internal Apache Arrow use:
> 
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> 
>> The ARROW pattern is a reserved namespace for internal
>> Arrow use in the custom_metadata fields. For example,
>> ARROW:extension:name.
> 
> So we can use "ARROW:statistics" for the metadata key.
> 
> We can represent statistics as a ArrowArray like ADBC does.
> 
> Here is an example ArrowSchema that is for a record batch
> that has "int32 column1" and "string column2":
> 
> ArrowSchema {
>   .format = "+siu",
>   .metadata = {
>     "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
> count */
>   },
>   .children = {
>     ArrowSchema {
>       .name = "column1",
>       .format = "i",
>       .metadata = {
>         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
> count distinct */
>       },
>     },
>     ArrowSchema {
>       .name = "column2",
>       .format = "u",
>       .metadata = {
>         "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
> count distinct */
>       },
>     },
>   },
> }
> 
> The metadata value (ArrowArray* part) of '"ARROW:statistics"
> => ArrowArray*' is a base 10 string of the address of the
> ArrowArray. Because we can use only string for metadata
> value. You can't release the statistics ArrowArray*. (Its
> release is a no-op function.) It follows
> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
> semantics. (The base ArrowSchema owns statistics
> ArrowArray*.)
> 
> 
> ArrowArray* for statistics use the following schema:
> 
> | Field Name     | Field Type                       | Comments |
> |----------------|----------------------------------| -------- |
> | key            | string not null                  | (1)      |
> | value          | `VALUE_SCHEMA` not null          |          |
> | is_approximate | bool not null                    | (2)      |
> 
> 1. We'll provide pre-defined keys such as "max", "min",
>    "byte_width" and "distinct_count" but users can also use
>    application specific keys.
> 
> 2. If true, then the value is approximate or best-effort.
> 
> VALUE_SCHEMA is a dense union with members:
> 
> | Field Name | Field Type                       | Comments |
> |------------|----------------------------------| -------- |
> | int64      | int64                            |          |
> | uint64     | uint64                           |          |
> | float64    | float64                          |          |
> | value      | The same type of the ArrowSchema | (3)      |
> |            | that is belonged to.             |          |
> 
> 3. If the ArrowSchema's type is string, this type is also string.
> 
>    TODO: Is "value" good name? If we refer it from the
>    top-level statistics schema, we need to use
>    "value.value". It's a bit strange...
> 
> 
> What do you think about this proposal? Could you share your
> comments?
> 
> 
> Thanks,
> -- 
> kou

Re: [DISCUSS] Statistics through the C data interface

Reply via email to