Hi Kou,

Thanks for pushing for this!

Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
4. Standardize Apache Arrow schema for statistics and
    transmit statistics via separated API call that uses the
    C data interface
[...]

I think that 4. is the best approach in these candidates.

I agree.

If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?

----
Metadata:

| Name                       | Value | Comments |
|----------------------------|-------|--------- |
| ARROW::statistics::version | 1.0.0 | (1)      |

I'm not sure this is useful, but it doesn't hurt.

Nit: this should be "ARROW:statistics:version" for consistency with https://arrow.apache.org/docs/format/Columnar.html#extension-types

Fields:

| Name           | Type                  | Comments |
|----------------|-----------------------| -------- |
| column         | utf8                  | (2)      |
| key            | utf8 not null         | (3)      |

1. Should the key be something like `dictionary(int32, utf8)` to make the representation more efficient where there are many columns?

2. Should the statistics perhaps be nested as a map type under each column to avoid repeating `column`, or is that overkill?

3. Should there also be room for multi-column statistics (such as cardinality of a given column pair), or is it too complex for now?

Regards

Antoine.

Reply via email to