Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sat, 08 Jun 2024 23:55:45 -0700

Hi,

> The exact types inside the dense_union would be chosen when encoding.


Ah, this approach doesn't standardize VALUE_SCHEMA (use a
fixed VALUE_SCHEMA). If it works in real world, it's more
flexible.

>      Version markers in two-sided protocols never work well long term:
> see Parquet files lying about the version of the encoder so the files
> can be read and web browsers lying on their User-Agent strings so
> websites don't break.

Could you explain more? I can understand the latter
case. But are there version information based compatibility
problems in Parquet files?


>                       It's better to allow probing for individual
> feature support (in this case, the presence of a specific stat kind in
> the array).

It will work for statistics kind case but does it work for
adding support for multi-column statistics case?

I think that we can't mix

map<
  // the column index or null if the statistics refer to whole table or batch
  column: int32,
  map<int32, dense_union<...needed types based on stat kinds in the keys...>>
>

and

map<
  // the column indexes or empty if the statistics refer to whole table or batch
  column: list<int32>,
  map<int32, dense_union<...needed types based on stat kinds in the keys...>>
>

.


Thanks,
-- 
kou

In <CAOC8YXZSE-b_zMTbpfDi_K+U3r6H19VT1Z_CuO8hGsBKx8Qf=w...@mail.gmail.com>
  "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 
23:27:18 -0300,
  Felipe Oliveira Carvalho <felipe...@gmail.com> wrote:

> I've been thinking about how to encode statistics on Arrow arrays and
> how to keep the set of statistics known by both producers and
> consumers (i.e. standardized).
> 
> The statistics array(s) could be a
> 
>   map<
>     // the column index or null if the statistics refer to whole table or 
> batch
>     column: int32,
>     map<int32, dense_union<...needed types based on stat kinds in the 
> keys...>>
>   >
> 
> The keys would be defined as part of the standard:
> 
> // Statistics values are identified by specified int32-valued keys
> // so that producers and consumers can agree on physical
> // encoding and semantics. Statistics can be about a column,
> // a record batch, or both.
> typedef ArrowStatKind int32_t;
> 
> #define ARROW_STAT_ANY 0
> // Exact number of nulls in a column. Value must be int32 or int64.
> #define ARROW_STAT_NULL_COUNT_EXACT 1
> // Approximate number of nulls in a column. Value must be float32 or float64.
> #define ARROW_STAT_NULL_COUNT_APPROX 2
> // The minimum and maximum values of a column.
> // Value must be the same type of the column.
> // Supported types are: ...
> #define ARROW_STAT_MIN_APROX 2
> #define ARROW_STAT_MIN_NULLS_FIRST 4
> #define ARROW_STAT_MIN_NULLS_LAST 5
> #define ARROW_STAT_MAX_APROX 6
> #define ARROW_STAT_MAX_NULLS_FIRST 7
> #define ARROW_STAT_MAX_NULLS_LAST 8
> #define ARROW_STAT_CARDINALITY_APPROX 9
> #define ARROW_STAT_COUNT_DISTINCT_APPROX 10
> 
> Every key is optional and consumers that don't know or don't care
> about the stats can skip them while scanning statistics arrays.
> 
> Applications would have their own domain classes for storing
> statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and
> unpack into these arrays.
> 
> The exact types inside the dense_union would be chosen when encoding.
> The decoder would handle the types expected and/or supported for each
> given stat kind.
> 
> We wouldn't have to rely on versioning of the entire statistics
> objects. If we want a richer way to represent a maximum, we add
> another stat kind to the spec and keep producing both the old and the
> new representations for the maximum while consumers migrate to the new
> way. Version markers in two-sided protocols never work well long term:
> see Parquet files lying about the version of the encoder so the files
> can be read and web browsers lying on their User-Agent strings so
> websites don't break. It's better to allow probing for individual
> feature support (in this case, the presence of a specific stat kind in
> the array).
> 
> Multiple calls could be done to load statistics and they could come
> with more statistics each time.
> 
> --
> Felipe
> 
> [1] 
> https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146
> 
> On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington
> <de...@voltrondata.com.invalid> wrote:
>>
>> Thank you for collecting all of our opinions on this! I also agree
>> that (4) is the best option.
>>
>> > Fields:
>> >
>> > | Name           | Type                  | Comments |
>> > |----------------|-----------------------| -------- |
>> > | column         | utf8                  | (2)      |
>>
>> The uft8 type would presume that column names are unique (although I
>> like it better than referring to columns by integer position).
>>
>> > If null, then the statistic applies to the entire table.
>>
>> Perhaps the NULL column value could also be used for the other
>> statistics in addition to a row count if the array is not a struct
>> array?
>>
>>
>> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou <anto...@python.org> wrote:
>> >
>> >
>> > Hi Kou,
>> >
>> > Thanks for pushing for this!
>> >
>> > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
>> > > 4. Standardize Apache Arrow schema for statistics and
>> > >     transmit statistics via separated API call that uses the
>> > >     C data interface
>> > [...]
>> > >
>> > > I think that 4. is the best approach in these candidates.
>> >
>> > I agree.
>> >
>> > > If we select 4., we need to standardize Apache Arrow schema
>> > > for statistics. How about the following schema?
>> > >
>> > > ----
>> > > Metadata:
>> > >
>> > > | Name                       | Value | Comments |
>> > > |----------------------------|-------|--------- |
>> > > | ARROW::statistics::version | 1.0.0 | (1)      |
>> >
>> > I'm not sure this is useful, but it doesn't hurt.
>> >
>> > Nit: this should be "ARROW:statistics:version" for consistency with
>> > https://arrow.apache.org/docs/format/Columnar.html#extension-types
>> >
>> > > Fields:
>> > >
>> > > | Name           | Type                  | Comments |
>> > > |----------------|-----------------------| -------- |
>> > > | column         | utf8                  | (2)      |
>> > > | key            | utf8 not null         | (3)      |
>> >
>> > 1. Should the key be something like `dictionary(int32, utf8)` to make
>> > the representation more efficient where there are many columns?
>> >
>> > 2. Should the statistics perhaps be nested as a map type under each
>> > column to avoid repeating `column`, or is that overkill?
>> >
>> > 3. Should there also be room for multi-column statistics (such as
>> > cardinality of a given column pair), or is it too complex for now?
>> >
>> > Regards
>> >
>> > Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to