Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sat, 08 Jun 2024 23:44:03 -0700

Hi,

>> | Name           | Type                  | Comments |
>> |----------------|-----------------------| -------- |
>> | column         | utf8                  | (2)      |
> 
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).


Ah, I forgot it. This proposal is based on ADBC's one. ADBC
can presume it because it's true on RDBMS. But we can't
presume it with the Apache Arrow columnar format.

So we need to use position.

Or we can use "FieldRef" in the C++ implementation:
https://github.com/apache/arrow/blob/399408cb273c47f490f65cdad95bc184a652826c/cpp/src/arrow/type.h#L2038-L2055

But it's complex to use this. So position is better.

>> If null, then the statistic applies to the entire table.
> 
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?

I didn't think of it. Thanks. It makes sense.


Thanks,
-- 
kou

In <CAFb7qScs2msp5EMOxv1J=f9cjfriujb+zdvb5-teah+wyfr...@mail.gmail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
22:06:41 -0300,
  Dewey Dunnington <de...@voltrondata.com.INVALID> wrote:

> Thank you for collecting all of our opinions on this! I also agree
> that (4) is the best option.
> 
>> Fields:
>>
>> | Name           | Type                  | Comments |
>> |----------------|-----------------------| -------- |
>> | column         | utf8                  | (2)      |
> 
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).
> 
>> If null, then the statistic applies to the entire table.
> 
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?
> 
> 
> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou <anto...@python.org> wrote:
>>
>>
>> Hi Kou,
>>
>> Thanks for pushing for this!
>>
>> Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> > 4. Standardize Apache Arrow schema for statistics and
>> >     transmit statistics via separated API call that uses the
>> >     C data interface
>> [...]
>> >
>> > I think that 4. is the best approach in these candidates.
>>
>> I agree.
>>
>> > If we select 4., we need to standardize Apache Arrow schema
>> > for statistics. How about the following schema?
>> >
>> > ----
>> > Metadata:
>> >
>> > | Name                       | Value | Comments |
>> > |----------------------------|-------|--------- |
>> > | ARROW::statistics::version | 1.0.0 | (1)      |
>>
>> I'm not sure this is useful, but it doesn't hurt.
>>
>> Nit: this should be "ARROW:statistics:version" for consistency with
>> https://arrow.apache.org/docs/format/Columnar.html#extension-types
>>
>> > Fields:
>> >
>> > | Name           | Type                  | Comments |
>> > |----------------|-----------------------| -------- |
>> > | column         | utf8                  | (2)      |
>> > | key            | utf8 not null         | (3)      |
>>
>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>> the representation more efficient where there are many columns?
>>
>> 2. Should the statistics perhaps be nested as a map type under each
>> column to avoid repeating `column`, or is that overkill?
>>
>> 3. Should there also be room for multi-column statistics (such as
>> cardinality of a given column pair), or is it too complex for now?
>>
>> Regards
>>
>> Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to