Re: [DISCUSS] Statistics through the C data interface

Antoine Pitrou Thu, 06 Jun 2024 02:42:21 -0700


Hi Kou,

Thanks for pushing for this!

Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :

4. Standardize Apache Arrow schema for statistics and
    transmit statistics via separated API call that uses the
    C data interface

[...]


I think that 4. is the best approach in these candidates.


I agree.

If we select 4., we need to standardize Apache Arrow schema
for statistics. How about the following schema?

----
Metadata:

| Name                       | Value | Comments |
|----------------------------|-------|--------- |
| ARROW::statistics::version | 1.0.0 | (1)      |


I'm not sure this is useful, but it doesn't hurt.

Nit: this should be "ARROW:statistics:version" for consistency withhttps://arrow.apache.org/docs/format/Columnar.html#extension-types

Fields:

| Name           | Type                  | Comments |
|----------------|-----------------------| -------- |
| column         | utf8                  | (2)      |
| key            | utf8 not null         | (3)      |

1. Should the key be something like `dictionary(int32, utf8)` to makethe representation more efficient where there are many columns?

2. Should the statistics perhaps be nested as a map type under eachcolumn to avoid repeating `column`, or is that overkill?

3. Should there also be room for multi-column statistics (such ascardinality of a given column pair), or is it too complex for now?


Regards

Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to