Re: [DISCUSS] Statistics through the C data interface

Dewey Dunnington Thu, 06 Jun 2024 18:07:01 -0700

Thank you for collecting all of our opinions on this! I also agree
that (4) is the best option.


> Fields:
>
> | Name           | Type                  | Comments |
> |----------------|-----------------------| -------- |
> | column         | utf8                  | (2)      |

The uft8 type would presume that column names are unique (although I
like it better than referring to columns by integer position).

> If null, then the statistic applies to the entire table.

Perhaps the NULL column value could also be used for the other
statistics in addition to a row count if the array is not a struct
array?


On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hi Kou,
>
> Thanks for pushing for this!
>
> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > 4. Standardize Apache Arrow schema for statistics and
> >     transmit statistics via separated API call that uses the
> >     C data interface
> [...]
> >
> > I think that 4. is the best approach in these candidates.
>
> I agree.
>
> > If we select 4., we need to standardize Apache Arrow schema
> > for statistics. How about the following schema?
> >
> > ----
> > Metadata:
> >
> > | Name                       | Value | Comments |
> > |----------------------------|-------|--------- |
> > | ARROW::statistics::version | 1.0.0 | (1)      |
>
> I'm not sure this is useful, but it doesn't hurt.
>
> Nit: this should be "ARROW:statistics:version" for consistency with
> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>
> > Fields:
> >
> > | Name           | Type                  | Comments |
> > |----------------|-----------------------| -------- |
> > | column         | utf8                  | (2)      |
> > | key            | utf8 not null         | (3)      |
>
> 1. Should the key be something like `dictionary(int32, utf8)` to make
> the representation more efficient where there are many columns?
>
> 2. Should the statistics perhaps be nested as a map type under each
> column to avoid repeating `column`, or is that overkill?
>
> 3. Should there also be room for multi-column statistics (such as
> cardinality of a given column pair), or is it too complex for now?
>
> Regards
>
> Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to