Hi, > The exact types inside the dense_union would be chosen when encoding.
Ah, this approach doesn't standardize VALUE_SCHEMA (use a fixed VALUE_SCHEMA). If it works in real world, it's more flexible. > Version markers in two-sided protocols never work well long term: > see Parquet files lying about the version of the encoder so the files > can be read and web browsers lying on their User-Agent strings so > websites don't break. Could you explain more? I can understand the latter case. But are there version information based compatibility problems in Parquet files? > It's better to allow probing for individual > feature support (in this case, the presence of a specific stat kind in > the array). It will work for statistics kind case but does it work for adding support for multi-column statistics case? I think that we can't mix map< // the column index or null if the statistics refer to whole table or batch column: int32, map<int32, dense_union<...needed types based on stat kinds in the keys...>> > and map< // the column indexes or empty if the statistics refer to whole table or batch column: list<int32>, map<int32, dense_union<...needed types based on stat kinds in the keys...>> > . Thanks, -- kou In <CAOC8YXZSE-b_zMTbpfDi_K+U3r6H19VT1Z_CuO8hGsBKx8Qf=w...@mail.gmail.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 6 Jun 2024 23:27:18 -0300, Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: > I've been thinking about how to encode statistics on Arrow arrays and > how to keep the set of statistics known by both producers and > consumers (i.e. standardized). > > The statistics array(s) could be a > > map< > // the column index or null if the statistics refer to whole table or > batch > column: int32, > map<int32, dense_union<...needed types based on stat kinds in the > keys...>> > > > > The keys would be defined as part of the standard: > > // Statistics values are identified by specified int32-valued keys > // so that producers and consumers can agree on physical > // encoding and semantics. Statistics can be about a column, > // a record batch, or both. > typedef ArrowStatKind int32_t; > > #define ARROW_STAT_ANY 0 > // Exact number of nulls in a column. Value must be int32 or int64. > #define ARROW_STAT_NULL_COUNT_EXACT 1 > // Approximate number of nulls in a column. Value must be float32 or float64. > #define ARROW_STAT_NULL_COUNT_APPROX 2 > // The minimum and maximum values of a column. > // Value must be the same type of the column. > // Supported types are: ... > #define ARROW_STAT_MIN_APROX 2 > #define ARROW_STAT_MIN_NULLS_FIRST 4 > #define ARROW_STAT_MIN_NULLS_LAST 5 > #define ARROW_STAT_MAX_APROX 6 > #define ARROW_STAT_MAX_NULLS_FIRST 7 > #define ARROW_STAT_MAX_NULLS_LAST 8 > #define ARROW_STAT_CARDINALITY_APPROX 9 > #define ARROW_STAT_COUNT_DISTINCT_APPROX 10 > > Every key is optional and consumers that don't know or don't care > about the stats can skip them while scanning statistics arrays. > > Applications would have their own domain classes for storing > statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and > unpack into these arrays. > > The exact types inside the dense_union would be chosen when encoding. > The decoder would handle the types expected and/or supported for each > given stat kind. > > We wouldn't have to rely on versioning of the entire statistics > objects. If we want a richer way to represent a maximum, we add > another stat kind to the spec and keep producing both the old and the > new representations for the maximum while consumers migrate to the new > way. Version markers in two-sided protocols never work well long term: > see Parquet files lying about the version of the encoder so the files > can be read and web browsers lying on their User-Agent strings so > websites don't break. It's better to allow probing for individual > feature support (in this case, the presence of a specific stat kind in > the array). > > Multiple calls could be done to load statistics and they could come > with more statistics each time. > > -- > Felipe > > [1] > https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146 > > On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington > <de...@voltrondata.com.invalid> wrote: >> >> Thank you for collecting all of our opinions on this! I also agree >> that (4) is the best option. >> >> > Fields: >> > >> > | Name | Type | Comments | >> > |----------------|-----------------------| -------- | >> > | column | utf8 | (2) | >> >> The uft8 type would presume that column names are unique (although I >> like it better than referring to columns by integer position). >> >> > If null, then the statistic applies to the entire table. >> >> Perhaps the NULL column value could also be used for the other >> statistics in addition to a row count if the array is not a struct >> array? >> >> >> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou <anto...@python.org> wrote: >> > >> > >> > Hi Kou, >> > >> > Thanks for pushing for this! >> > >> > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : >> > > 4. Standardize Apache Arrow schema for statistics and >> > > transmit statistics via separated API call that uses the >> > > C data interface >> > [...] >> > > >> > > I think that 4. is the best approach in these candidates. >> > >> > I agree. >> > >> > > If we select 4., we need to standardize Apache Arrow schema >> > > for statistics. How about the following schema? >> > > >> > > ---- >> > > Metadata: >> > > >> > > | Name | Value | Comments | >> > > |----------------------------|-------|--------- | >> > > | ARROW::statistics::version | 1.0.0 | (1) | >> > >> > I'm not sure this is useful, but it doesn't hurt. >> > >> > Nit: this should be "ARROW:statistics:version" for consistency with >> > https://arrow.apache.org/docs/format/Columnar.html#extension-types >> > >> > > Fields: >> > > >> > > | Name | Type | Comments | >> > > |----------------|-----------------------| -------- | >> > > | column | utf8 | (2) | >> > > | key | utf8 not null | (3) | >> > >> > 1. Should the key be something like `dictionary(int32, utf8)` to make >> > the representation more efficient where there are many columns? >> > >> > 2. Should the statistics perhaps be nested as a map type under each >> > column to avoid repeating `column`, or is that overkill? >> > >> > 3. Should there also be room for multi-column statistics (such as >> > cardinality of a given column pair), or is it too complex for now? >> > >> > Regards >> > >> > Antoine.