Re: [DISCUSS] Statistics through the C data interface

Felipe Oliveira Carvalho Thu, 06 Jun 2024 19:27:44 -0700

I've been thinking about how to encode statistics on Arrow arrays and
how to keep the set of statistics known by both producers and
consumers (i.e. standardized).

The statistics array(s) could be a

  map<
    // the column index or null if the statistics refer to whole table or batch
    column: int32,
    map<int32, dense_union<...needed types based on stat kinds in the keys...>>
  >

The keys would be defined as part of the standard:

// Statistics values are identified by specified int32-valued keys
// so that producers and consumers can agree on physical
// encoding and semantics. Statistics can be about a column,
// a record batch, or both.
typedef ArrowStatKind int32_t;

#define ARROW_STAT_ANY 0
// Exact number of nulls in a column. Value must be int32 or int64.
#define ARROW_STAT_NULL_COUNT_EXACT 1
// Approximate number of nulls in a column. Value must be float32 or float64.
#define ARROW_STAT_NULL_COUNT_APPROX 2
// The minimum and maximum values of a column.
// Value must be the same type of the column.
// Supported types are: ...
#define ARROW_STAT_MIN_APROX 2
#define ARROW_STAT_MIN_NULLS_FIRST 4
#define ARROW_STAT_MIN_NULLS_LAST 5
#define ARROW_STAT_MAX_APROX 6
#define ARROW_STAT_MAX_NULLS_FIRST 7
#define ARROW_STAT_MAX_NULLS_LAST 8
#define ARROW_STAT_CARDINALITY_APPROX 9
#define ARROW_STAT_COUNT_DISTINCT_APPROX 10

Every key is optional and consumers that don't know or don't care
about the stats can skip them while scanning statistics arrays.

Applications would have their own domain classes for storing
statistics (e.g. DuckDB's BaseStatistics [1]) and a way to pack and
unpack into these arrays.

The exact types inside the dense_union would be chosen when encoding.
The decoder would handle the types expected and/or supported for each
given stat kind.

We wouldn't have to rely on versioning of the entire statistics
objects. If we want a richer way to represent a maximum, we add
another stat kind to the spec and keep producing both the old and the
new representations for the maximum while consumers migrate to the new
way. Version markers in two-sided protocols never work well long term:
see Parquet files lying about the version of the encoder so the files
can be read and web browsers lying on their User-Agent strings so
websites don't break. It's better to allow probing for individual
feature support (in this case, the presence of a specific stat kind in
the array).

Multiple calls could be done to load statistics and they could come
with more statistics each time.

--
Felipe

[1] 
https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146

On Thu, Jun 6, 2024 at 10:07 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:
>
> Thank you for collecting all of our opinions on this! I also agree
> that (4) is the best option.
>
> > Fields:
> >
> > | Name           | Type                  | Comments |
> > |----------------|-----------------------| -------- |
> > | column         | utf8                  | (2)      |
>
> The uft8 type would presume that column names are unique (although I
> like it better than referring to columns by integer position).
>
> > If null, then the statistic applies to the entire table.
>
> Perhaps the NULL column value could also be used for the other
> statistics in addition to a row count if the array is not a struct
> array?
>
>
> On Thu, Jun 6, 2024 at 6:42 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Hi Kou,
> >
> > Thanks for pushing for this!
> >
> > Le 06/06/2024 à 11:27, Sutou Kouhei a écrit :
> > > 4. Standardize Apache Arrow schema for statistics and
> > >     transmit statistics via separated API call that uses the
> > >     C data interface
> > [...]
> > >
> > > I think that 4. is the best approach in these candidates.
> >
> > I agree.
> >
> > > If we select 4., we need to standardize Apache Arrow schema
> > > for statistics. How about the following schema?
> > >
> > > ----
> > > Metadata:
> > >
> > > | Name                       | Value | Comments |
> > > |----------------------------|-------|--------- |
> > > | ARROW::statistics::version | 1.0.0 | (1)      |
> >
> > I'm not sure this is useful, but it doesn't hurt.
> >
> > Nit: this should be "ARROW:statistics:version" for consistency with
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> > > Fields:
> > >
> > > | Name           | Type                  | Comments |
> > > |----------------|-----------------------| -------- |
> > > | column         | utf8                  | (2)      |
> > > | key            | utf8 not null         | (3)      |
> >
> > 1. Should the key be something like `dictionary(int32, utf8)` to make
> > the representation more efficient where there are many columns?
> >
> > 2. Should the statistics perhaps be nested as a map type under each
> > column to avoid repeating `column`, or is that overkill?
> >
> > 3. Should there also be room for multi-column statistics (such as
> > cardinality of a given column pair), or is it too complex for now?
> >
> > Regards
> >
> > Antoine.

Re: [DISCUSS] Statistics through the C data interface

Reply via email to