Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Wed, 24 Jul 2024 22:35:51 -0700

Hi,

If nobody objects using utf8 or dictionary<int32, utf8> for
statistics key, let's use dictionary<int32, utf8>. Because
dictionary<int32, utf8> will be more effective than utf8
when there are many columns.


I'll start writing a documentation for this and implementing
this for C++ next week. I'll share a PR for them when I
complete them. We can start a vote for this after we review
the PR.


Thanks,
-- 
kou

In <[email protected]>
  "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul 2024 
15:15:36 +0900 (JST),
  Sutou Kouhei <[email protected]> wrote:

> Hi,
> 
>>> map<struct<int32, utf8>,
>>>     dense_union<...needed types based on stat kinds in the keys...>>
>>>
>> 
>> Yes. That's my suggestion. And to leverage the fact that libraries handles
>> unions gracefully, this could be:
>> 
>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds
>> in the keys...>>
>> 
>> X is either sparse or dense.
>> 
>> A possible alternative is to use a custom struct instead of map and reduce
>> the levels of nesting:
>> 
>> struct<int32, utf8, dense_union<...needed types base on the keys...>>
> 
> Thanks for clarifying your suggestion.
> 
> If we need utf8 for non-standard statistics, I think that
> map<utf8, ...> or map<dictionary<int32, utf8>, ...> is
> better as Antoine said. Because they are simpler than
> int32+utf8.
> 
> 
> Thanks,
> -- 
> kou
> 
> In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com>
>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul 2024 
> 14:17:46 -0300,
>   Felipe Oliveira Carvalho <[email protected]> wrote:
> 
>> On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <[email protected]> wrote:
>> 
>>> Hi,
>>>
>>> >            for non-standard statistics from open-source products the
>>> key=0
>>> > combined with string label is the way to go
>>>
>>> Where do we store the string label?
>>>
>>> I think that we're considering the following schema:
>>>
>>> >> map<
>>> >>   // The column index or null if the statistics refer to whole table or
>>> batch.
>>> >>   column: int32,
>>> >>   // Statistics key is int32.
>>> >>   // Different keys are assigned for exact value and
>>> >>   // approximate value.
>>> >>   map<int32, dense_union<...needed types based on stat kinds in the
>>> keys...>>
>>> >> >
>>>
>>> Are you considering the following schema for key=0 case?
>>>
>>> map<struct<int32, utf8>,
>>>     dense_union<...needed types based on stat kinds in the keys...>>
>>>
>> 
>> Yes. That's my suggestion. And to leverage the fact that libraries handles
>> unions gracefully, this could be:
>> 
>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds
>> in the keys...>>
>> 
>> X is either sparse or dense.
>> 
>> A possible alternative is to use a custom struct instead of map and reduce
>> the levels of nesting:
>> 
>> struct<int32, utf8, dense_union<...needed types base on the keys...>>
>> 
>> --
>> Felipe
>> 
>> 
>>>
>>> Thanks,
>>> --
>>> kou
>>>
>>> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com>
>>>   "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul
>>> 2024 11:58:44 -0300,
>>>   Felipe Oliveira Carvalho <[email protected]> wrote:
>>>
>>> > Hi,
>>> >
>>> > You can promise that well-known int32 statistic keys won't ever be higher
>>> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in
>>> [0,
>>> > 2^10)) but for non-standard statistics from open-source products the
>>> key=0
>>> > combined with string label is the way to go, otherwise collisions would
>>> be
>>> > inevitable and everyone would hate us for having integer keys.
>>> >
>>> > This is not a very weird proposal from my part because integer keys
>>> > representing labels are common in most low-level standardized C APIs
>>> (e.g.
>>> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to
>>> > map all these keys to strings, but with them we keep the "C Data
>>> Interface"
>>> > low-level and portable as it should be.
>>> >
>>> > --
>>> > Felipe
>>> >
>>> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums
>>> > because C limits enum to 2^16 and that is *not enough*.
>>> >
>>> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <[email protected]> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> Here is an updated summary so far:
>>> >>
>>> >> ----
>>> >> Use cases:
>>> >>
>>> >> * Optimize query plan: e.g. JOIN for DuckDB
>>> >>
>>> >> Out of scope:
>>> >>
>>> >> * Transmit statistics through not the C data interface
>>> >>   Examples:
>>> >>   * Transmit statistics through Apache Arrow IPC file
>>> >>   * Transmit statistics through Apache Arrow Flight
>>> >> * Multi-column statistics
>>> >> * Constraints information
>>> >> * Indexes information
>>> >>
>>> >> Discussing approach:
>>> >>
>>> >> Standardize Apache Arrow schema for statistics and transmit
>>> >> statistics via separated API call that uses the C data
>>> >> interface.
>>> >>
>>> >> This also works for per-batch statistics.
>>> >>
>>> >> Candidate schema:
>>> >>
>>> >> map<
>>> >>   // The column index or null if the statistics refer to whole table or
>>> >> batch.
>>> >>   column: int32,
>>> >>   // Statistics key is int32.
>>> >>   // Different keys are assigned for exact value and
>>> >>   // approximate value.
>>> >>   map<int32, dense_union<...needed types based on stat kinds in the
>>> >> keys...>>
>>> >> >
>>> >>
>>> >> Discussions:
>>> >>
>>> >> 1. Can we use int32 for statistic keys?
>>> >>    Should we use utf8 (or dictionary<int32, utf8>) for
>>> >>    statistic keys?
>>> >> 2. Hot to support non-standard (vendor-specific)
>>> >>    statistic keys?
>>> >> ----
>>> >>
>>> >> Here is my idea:
>>> >>
>>> >> 1. We can use int32 for statistic keys.
>>> >> 2. We can reserve a specific range for non-standard
>>> >>    statistic keys. Prerequisites of this:
>>> >>    * There is no use case to merge some statistics for
>>> >>      the same data.
>>> >>    * We can't merge statistics for different data.
>>> >>
>>> >> If the prerequisites aren't satisfied:
>>> >>
>>> >> 1. We should use utf8 (or dictionary<int32, utf8>) for
>>> >>    statistic keys?
>>> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for
>>> >>    standard statistic keys or use prefix such as
>>> >>    "vendor1:"/"vendor1." for non-standard statistic keys.
>>> >>
>>> >> Here is Felipe's idea:
>>> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2
>>> >>
>>> >> 1. We can use int32 for statistic keys.
>>> >> 2. We can use the special statistic key + a string identifier
>>> >>    for non-standard statistic keys.
>>> >>
>>> >>
>>> >> What do you think about this?
>>> >>
>>> >>
>>> >> Thanks,
>>> >> --
>>> >> kou
>>> >>
>>> >> In <[email protected]>
>>> >>   "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun
>>> >> 2024 18:27:27 +0900 (JST),
>>> >>   Sutou Kouhei <[email protected]> wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > Thanks for sharing your comments. Here is a summary so far:
>>> >> >
>>> >> > ----
>>> >> >
>>> >> > Use cases:
>>> >> >
>>> >> > * Optimize query plan: e.g. JOIN for DuckDB
>>> >> >
>>> >> > Out of scope:
>>> >> >
>>> >> > * Transmit statistics through not the C data interface
>>> >> >   Examples:
>>> >> >   * Transmit statistics through Apache Arrow IPC file
>>> >> >   * Transmit statistics through Apache Arrow Flight
>>> >> >
>>> >> > Candidate approaches:
>>> >> >
>>> >> > 1. Pass statistics (encoded as an Apache Arrow data) via
>>> >> >    ArrowSchema metadata
>>> >> >    * This embeds statistics address into metadata
>>> >> >    * It's for avoiding using Apache Arrow IPC format with
>>> >> >      the C data interface
>>> >> > 2. Embed statistics (encoded as an Apache Arrow data) into
>>> >> >    ArrowSchema metadata
>>> >> >    * This adds statistics to metadata in Apache Arrow IPC
>>> >> >      format
>>> >> > 3. Embed statistics (encoded as JSON) into ArrowArray
>>> >> >    metadata
>>> >> > 4. Standardize Apache Arrow schema for statistics and
>>> >> >    transmit statistics via separated API call that uses the
>>> >> >    C data interface
>>> >> > 5. Use ADBC
>>> >> >
>>> >> > ----
>>> >> >
>>> >> > I think that 4. is the best approach in these candidates.
>>> >> >
>>> >> > 1. Embedding statistics address is tricky.
>>> >> > 2. Consumers need to parse Apache Arrow IPC format data.
>>> >> >    (The C data interface consumers may not have the
>>> >> >    feature.)
>>> >> > 3. This will work but 4. is more generic.
>>> >> > 5. ADBC is too large to use only for statistics.
>>> >> >
>>> >> > What do you think about this?
>>> >> >
>>> >> >
>>> >> > If we select 4., we need to standardize Apache Arrow schema
>>> >> > for statistics. How about the following schema?
>>> >> >
>>> >> > ----
>>> >> > Metadata:
>>> >> >
>>> >> > | Name                       | Value | Comments |
>>> >> > |----------------------------|-------|--------- |
>>> >> > | ARROW::statistics::version | 1.0.0 | (1)      |
>>> >> >
>>> >> > (1) This follows semantic versioning.
>>> >> >
>>> >> > Fields:
>>> >> >
>>> >> > | Name           | Type                  | Comments |
>>> >> > |----------------|-----------------------| -------- |
>>> >> > | column         | utf8                  | (2)      |
>>> >> > | key            | utf8 not null         | (3)      |
>>> >> > | value          | VALUE_SCHEMA not null |          |
>>> >> > | is_approximate | bool not null         | (4)      |
>>> >> >
>>> >> > (2) If null, then the statistic applies to the entire table.
>>> >> >     It's for "row_count".
>>> >> > (3) We'll provide pre-defined keys such as "max", "min",
>>> >> >     "byte_width" and "distinct_count" but users can also use
>>> >> >     application specific keys.
>>> >> > (4) If true, then the value is approximate or best-effort.
>>> >> >
>>> >> > VALUE_SCHEMA is a dense union with members:
>>> >> >
>>> >> > | Name    | Type    |
>>> >> > |---------|---------|
>>> >> > | int64   | int64   |
>>> >> > | uint64  | uint64  |
>>> >> > | float64 | float64 |
>>> >> > | binary  | binary  |
>>> >> >
>>> >> > If a column is an int32 column, it uses int64 for
>>> >> > "max"/"min". We don't provide all types here. Users should
>>> >> > use a compatible type (int64 for a int32 column) instead.
>>> >> > ----
>>> >> >
>>> >> >
>>> >> > Thanks,
>>> >> > --
>>> >> > kou
>>> >> >
>>> >> >
>>> >> > In <[email protected]>
>>> >> >   "[DISCUSS] Statistics through the C data interface" on Wed, 22 May
>>> >> 2024 11:37:08 +0900 (JST),
>>> >> >   Sutou Kouhei <[email protected]> wrote:
>>> >> >
>>> >> >> Hi,
>>> >> >>
>>> >> >> We're discussing how to provide statistics through the C
>>> >> >> data interface at:
>>> >> >> https://github.com/apache/arrow/issues/38837
>>> >> >>
>>> >> >> If you're interested in this feature, could you share your
>>> >> >> comments?
>>> >> >>
>>> >> >>
>>> >> >> Motivation:
>>> >> >>
>>> >> >> We can interchange Apache Arrow data by the C data interface
>>> >> >> in the same process. For example, we can pass Apache Arrow
>>> >> >> data read by Apache Arrow C++ (provider) to DuckDB
>>> >> >> (consumer) through the C data interface.
>>> >> >>
>>> >> >> A provider may know Apache Arrow data statistics. For
>>> >> >> example, a provider can know statistics when it reads Apache
>>> >> >> Parquet data because Apache Parquet may provide statistics.
>>> >> >>
>>> >> >> But a consumer can't know statistics that are known by a
>>> >> >> producer. Because there isn't a standard way to provide
>>> >> >> statistics through the C data interface. If a consumer can
>>> >> >> know statistics, it can process Apache Arrow data faster
>>> >> >> based on statistics.
>>> >> >>
>>> >> >>
>>> >> >> Proposal:
>>> >> >>
>>> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
>>> >> >>
>>> >> >> How about providing statistics as a metadata in ArrowSchema?
>>> >> >>
>>> >> >> We reserve "ARROW" namespace for internal Apache Arrow use:
>>> >> >>
>>> >> >>
>>> >>
>>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
>>> >> >>
>>> >> >>> The ARROW pattern is a reserved namespace for internal
>>> >> >>> Arrow use in the custom_metadata fields. For example,
>>> >> >>> ARROW:extension:name.
>>> >> >>
>>> >> >> So we can use "ARROW:statistics" for the metadata key.
>>> >> >>
>>> >> >> We can represent statistics as a ArrowArray like ADBC does.
>>> >> >>
>>> >> >> Here is an example ArrowSchema that is for a record batch
>>> >> >> that has "int32 column1" and "string column2":
>>> >> >>
>>> >> >> ArrowSchema {
>>> >> >>   .format = "+siu",
>>> >> >>   .metadata = {
>>> >> >>     "ARROW:statistics" => ArrowArray*, /* table-level statistics such
>>> >> as row count */
>>> >> >>   },
>>> >> >>   .children = {
>>> >> >>     ArrowSchema {
>>> >> >>       .name = "column1",
>>> >> >>       .format = "i",
>>> >> >>       .metadata = {
>>> >> >>         "ARROW:statistics" => ArrowArray*, /* column-level statistics
>>> >> such as count distinct */
>>> >> >>       },
>>> >> >>     },
>>> >> >>     ArrowSchema {
>>> >> >>       .name = "column2",
>>> >> >>       .format = "u",
>>> >> >>       .metadata = {
>>> >> >>         "ARROW:statistics" => ArrowArray*, /* column-level statistics
>>> >> such as count distinct */
>>> >> >>       },
>>> >> >>     },
>>> >> >>   },
>>> >> >> }
>>> >> >>
>>> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics"
>>> >> >> => ArrowArray*' is a base 10 string of the address of the
>>> >> >> ArrowArray. Because we can use only string for metadata
>>> >> >> value. You can't release the statistics ArrowArray*. (Its
>>> >> >> release is a no-op function.) It follows
>>> >> >>
>>> >>
>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
>>> >> >> semantics. (The base ArrowSchema owns statistics
>>> >> >> ArrowArray*.)
>>> >> >>
>>> >> >>
>>> >> >> ArrowArray* for statistics use the following schema:
>>> >> >>
>>> >> >> | Field Name     | Field Type                       | Comments |
>>> >> >> |----------------|----------------------------------| -------- |
>>> >> >> | key            | string not null                  | (1)      |
>>> >> >> | value          | `VALUE_SCHEMA` not null          |          |
>>> >> >> | is_approximate | bool not null                    | (2)      |
>>> >> >>
>>> >> >> 1. We'll provide pre-defined keys such as "max", "min",
>>> >> >>    "byte_width" and "distinct_count" but users can also use
>>> >> >>    application specific keys.
>>> >> >>
>>> >> >> 2. If true, then the value is approximate or best-effort.
>>> >> >>
>>> >> >> VALUE_SCHEMA is a dense union with members:
>>> >> >>
>>> >> >> | Field Name | Field Type                       | Comments |
>>> >> >> |------------|----------------------------------| -------- |
>>> >> >> | int64      | int64                            |          |
>>> >> >> | uint64     | uint64                           |          |
>>> >> >> | float64    | float64                          |          |
>>> >> >> | value      | The same type of the ArrowSchema | (3)      |
>>> >> >> |            | that is belonged to.             |          |
>>> >> >>
>>> >> >> 3. If the ArrowSchema's type is string, this type is also string.
>>> >> >>
>>> >> >>    TODO: Is "value" good name? If we refer it from the
>>> >> >>    top-level statistics schema, we need to use
>>> >> >>    "value.value". It's a bit strange...
>>> >> >>
>>> >> >>
>>> >> >> What do you think about this proposal? Could you share your
>>> >> >> comments?
>>> >> >>
>>> >> >>
>>> >> >> Thanks,
>>> >> >> --
>>> >> >> kou
>>> >>
>>>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to