Hi Kou,
Thanks for pushing for this!
Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
4. Standardize Apache Arrow schema for statistics and
transmit statistics via separated API call that uses the
C data interface
[...]
I think that 4. is the best approach in these candidates.
I agree.
If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?
----
Metadata:
| Name | Value | Comments |
|----------------------------|-------|--------- |
| ARROW::statistics::version | 1.0.0 | (1) |
I'm not sure this is useful, but it doesn't hurt.
Nit: this should be "ARROW:statistics:version" for consistency with
https://arrow.apache.org/docs/format/Columnar.html#extension-types
Fields:
| Name | Type | Comments |
|----------------|-----------------------| -------- |
| column | utf8 | (2) |
| key | utf8 not null | (3) |
1. Should the key be something like `dictionary(int32, utf8)` to make
the representation more efficient where there are many columns?
2. Should the statistics perhaps be nested as a map type under each
column to avoid repeating `column`, or is that overkill?
3. Should there also be room for multi-column statistics (such as
cardinality of a given column pair), or is it too complex for now?
Regards
Antoine.