Thank you for collecting all of our opinions on this! I also agree that (4) is the best option.
> Fields: > > | Name | Type | Comments | > |----------------|-----------------------| -------- | > | column | utf8 | (2) | The uft8 type would presume that column names are unique (although I like it better than referring to columns by integer position). > If null, then the statistic applies to the entire table. Perhaps the NULL column value could also be used for the other statistics in addition to a row count if the array is not a struct array? On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hi Kou, > > Thanks for pushing for this! > > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : > > 4. Standardize Apache Arrow schema for statistics and > > transmit statistics via separated API call that uses the > > C data interface > [...] > > > > I think that 4. is the best approach in these candidates. > > I agree. > > > If we select 4., we need to standardize Apache Arrow schema > > for statistics. How about the following schema? > > > > ---- > > Metadata: > > > > | Name | Value | Comments | > > |----------------------------|-------|--------- | > > | ARROW::statistics::version | 1.0.0 | (1) | > > I'm not sure this is useful, but it doesn't hurt. > > Nit: this should be "ARROW:statistics:version" for consistency with > https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > Fields: > > > > | Name | Type | Comments | > > |----------------|-----------------------| -------- | > > | column | utf8 | (2) | > > | key | utf8 not null | (3) | > > 1. Should the key be something like `dictionary(int32, utf8)` to make > the representation more efficient where there are many columns? > > 2. Should the statistics perhaps be nested as a map type under each > column to avoid repeating `column`, or is that overkill? > > 3. Should there also be room for multi-column statistics (such as > cardinality of a given column pair), or is it too complex for now? > > Regards > > Antoine.