Hi, I've opened a PR for documentation: https://github.com/apache/arrow/pull/43553
The "Example" section isn't written yet but suggestions are very welcome. Thanks, -- kou In <20240725.143518.421507820763165665....@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 25 Jul 2024 14:35:18 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > If nobody objects using utf8 or dictionary<int32, utf8> for > statistics key, let's use dictionary<int32, utf8>. Because > dictionary<int32, utf8> will be more effective than utf8 > when there are many columns. > > I'll start writing a documentation for this and implementing > this for C++ next week. I'll share a PR for them when I > complete them. We can start a vote for this after we review > the PR. > > > Thanks, > -- > kou > > In <20240712.151536.312169170508271330....@clear-code.com> > "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul 2024 > 15:15:36 +0900 (JST), > Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >>>> map<struct<int32, utf8>, >>>> dense_union<...needed types based on stat kinds in the keys...>> >>>> >>> >>> Yes. That's my suggestion. And to leverage the fact that libraries handles >>> unions gracefully, this could be: >>> >>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >>> in the keys...>> >>> >>> X is either sparse or dense. >>> >>> A possible alternative is to use a custom struct instead of map and reduce >>> the levels of nesting: >>> >>> struct<int32, utf8, dense_union<...needed types base on the keys...>> >> >> Thanks for clarifying your suggestion. >> >> If we need utf8 for non-standard statistics, I think that >> map<utf8, ...> or map<dictionary<int32, utf8>, ...> is >> better as Antoine said. Because they are simpler than >> int32+utf8. >> >> >> Thanks, >> -- >> kou >> >> In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul >> 2024 14:17:46 -0300, >> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >> >>> On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote: >>> >>>> Hi, >>>> >>>> > for non-standard statistics from open-source products the >>>> key=0 >>>> > combined with string label is the way to go >>>> >>>> Where do we store the string label? >>>> >>>> I think that we're considering the following schema: >>>> >>>> >> map< >>>> >> // The column index or null if the statistics refer to whole table or >>>> batch. >>>> >> column: int32, >>>> >> // Statistics key is int32. >>>> >> // Different keys are assigned for exact value and >>>> >> // approximate value. >>>> >> map<int32, dense_union<...needed types based on stat kinds in the >>>> keys...>> >>>> >> > >>>> >>>> Are you considering the following schema for key=0 case? >>>> >>>> map<struct<int32, utf8>, >>>> dense_union<...needed types based on stat kinds in the keys...>> >>>> >>> >>> Yes. That's my suggestion. And to leverage the fact that libraries handles >>> unions gracefully, this could be: >>> >>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >>> in the keys...>> >>> >>> X is either sparse or dense. >>> >>> A possible alternative is to use a custom struct instead of map and reduce >>> the levels of nesting: >>> >>> struct<int32, utf8, dense_union<...needed types base on the keys...>> >>> >>> -- >>> Felipe >>> >>> >>>> >>>> Thanks, >>>> -- >>>> kou >>>> >>>> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com> >>>> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul >>>> 2024 11:58:44 -0300, >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >>>> >>>> > Hi, >>>> > >>>> > You can promise that well-known int32 statistic keys won't ever be higher >>>> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in >>>> [0, >>>> > 2^10)) but for non-standard statistics from open-source products the >>>> key=0 >>>> > combined with string label is the way to go, otherwise collisions would >>>> be >>>> > inevitable and everyone would hate us for having integer keys. >>>> > >>>> > This is not a very weird proposal from my part because integer keys >>>> > representing labels are common in most low-level standardized C APIs >>>> (e.g. >>>> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to >>>> > map all these keys to strings, but with them we keep the "C Data >>>> Interface" >>>> > low-level and portable as it should be. >>>> > >>>> > -- >>>> > Felipe >>>> > >>>> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums >>>> > because C limits enum to 2^16 and that is *not enough*. >>>> > >>>> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> Here is an updated summary so far: >>>> >> >>>> >> ---- >>>> >> Use cases: >>>> >> >>>> >> * Optimize query plan: e.g. JOIN for DuckDB >>>> >> >>>> >> Out of scope: >>>> >> >>>> >> * Transmit statistics through not the C data interface >>>> >> Examples: >>>> >> * Transmit statistics through Apache Arrow IPC file >>>> >> * Transmit statistics through Apache Arrow Flight >>>> >> * Multi-column statistics >>>> >> * Constraints information >>>> >> * Indexes information >>>> >> >>>> >> Discussing approach: >>>> >> >>>> >> Standardize Apache Arrow schema for statistics and transmit >>>> >> statistics via separated API call that uses the C data >>>> >> interface. >>>> >> >>>> >> This also works for per-batch statistics. >>>> >> >>>> >> Candidate schema: >>>> >> >>>> >> map< >>>> >> // The column index or null if the statistics refer to whole table or >>>> >> batch. >>>> >> column: int32, >>>> >> // Statistics key is int32. >>>> >> // Different keys are assigned for exact value and >>>> >> // approximate value. >>>> >> map<int32, dense_union<...needed types based on stat kinds in the >>>> >> keys...>> >>>> >> > >>>> >> >>>> >> Discussions: >>>> >> >>>> >> 1. Can we use int32 for statistic keys? >>>> >> Should we use utf8 (or dictionary<int32, utf8>) for >>>> >> statistic keys? >>>> >> 2. Hot to support non-standard (vendor-specific) >>>> >> statistic keys? >>>> >> ---- >>>> >> >>>> >> Here is my idea: >>>> >> >>>> >> 1. We can use int32 for statistic keys. >>>> >> 2. We can reserve a specific range for non-standard >>>> >> statistic keys. Prerequisites of this: >>>> >> * There is no use case to merge some statistics for >>>> >> the same data. >>>> >> * We can't merge statistics for different data. >>>> >> >>>> >> If the prerequisites aren't satisfied: >>>> >> >>>> >> 1. We should use utf8 (or dictionary<int32, utf8>) for >>>> >> statistic keys? >>>> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for >>>> >> standard statistic keys or use prefix such as >>>> >> "vendor1:"/"vendor1." for non-standard statistic keys. >>>> >> >>>> >> Here is Felipe's idea: >>>> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 >>>> >> >>>> >> 1. We can use int32 for statistic keys. >>>> >> 2. We can use the special statistic key + a string identifier >>>> >> for non-standard statistic keys. >>>> >> >>>> >> >>>> >> What do you think about this? >>>> >> >>>> >> >>>> >> Thanks, >>>> >> -- >>>> >> kou >>>> >> >>>> >> In <20240606.182727.1004633558059795207....@clear-code.com> >>>> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun >>>> >> 2024 18:27:27 +0900 (JST), >>>> >> Sutou Kouhei <k...@clear-code.com> wrote: >>>> >> >>>> >> > Hi, >>>> >> > >>>> >> > Thanks for sharing your comments. Here is a summary so far: >>>> >> > >>>> >> > ---- >>>> >> > >>>> >> > Use cases: >>>> >> > >>>> >> > * Optimize query plan: e.g. JOIN for DuckDB >>>> >> > >>>> >> > Out of scope: >>>> >> > >>>> >> > * Transmit statistics through not the C data interface >>>> >> > Examples: >>>> >> > * Transmit statistics through Apache Arrow IPC file >>>> >> > * Transmit statistics through Apache Arrow Flight >>>> >> > >>>> >> > Candidate approaches: >>>> >> > >>>> >> > 1. Pass statistics (encoded as an Apache Arrow data) via >>>> >> > ArrowSchema metadata >>>> >> > * This embeds statistics address into metadata >>>> >> > * It's for avoiding using Apache Arrow IPC format with >>>> >> > the C data interface >>>> >> > 2. Embed statistics (encoded as an Apache Arrow data) into >>>> >> > ArrowSchema metadata >>>> >> > * This adds statistics to metadata in Apache Arrow IPC >>>> >> > format >>>> >> > 3. Embed statistics (encoded as JSON) into ArrowArray >>>> >> > metadata >>>> >> > 4. Standardize Apache Arrow schema for statistics and >>>> >> > transmit statistics via separated API call that uses the >>>> >> > C data interface >>>> >> > 5. Use ADBC >>>> >> > >>>> >> > ---- >>>> >> > >>>> >> > I think that 4. is the best approach in these candidates. >>>> >> > >>>> >> > 1. Embedding statistics address is tricky. >>>> >> > 2. Consumers need to parse Apache Arrow IPC format data. >>>> >> > (The C data interface consumers may not have the >>>> >> > feature.) >>>> >> > 3. This will work but 4. is more generic. >>>> >> > 5. ADBC is too large to use only for statistics. >>>> >> > >>>> >> > What do you think about this? >>>> >> > >>>> >> > >>>> >> > If we select 4., we need to standardize Apache Arrow schema >>>> >> > for statistics. How about the following schema? >>>> >> > >>>> >> > ---- >>>> >> > Metadata: >>>> >> > >>>> >> > | Name | Value | Comments | >>>> >> > |----------------------------|-------|--------- | >>>> >> > | ARROW::statistics::version | 1.0.0 | (1) | >>>> >> > >>>> >> > (1) This follows semantic versioning. >>>> >> > >>>> >> > Fields: >>>> >> > >>>> >> > | Name | Type | Comments | >>>> >> > |----------------|-----------------------| -------- | >>>> >> > | column | utf8 | (2) | >>>> >> > | key | utf8 not null | (3) | >>>> >> > | value | VALUE_SCHEMA not null | | >>>> >> > | is_approximate | bool not null | (4) | >>>> >> > >>>> >> > (2) If null, then the statistic applies to the entire table. >>>> >> > It's for "row_count". >>>> >> > (3) We'll provide pre-defined keys such as "max", "min", >>>> >> > "byte_width" and "distinct_count" but users can also use >>>> >> > application specific keys. >>>> >> > (4) If true, then the value is approximate or best-effort. >>>> >> > >>>> >> > VALUE_SCHEMA is a dense union with members: >>>> >> > >>>> >> > | Name | Type | >>>> >> > |---------|---------| >>>> >> > | int64 | int64 | >>>> >> > | uint64 | uint64 | >>>> >> > | float64 | float64 | >>>> >> > | binary | binary | >>>> >> > >>>> >> > If a column is an int32 column, it uses int64 for >>>> >> > "max"/"min". We don't provide all types here. Users should >>>> >> > use a compatible type (int64 for a int32 column) instead. >>>> >> > ---- >>>> >> > >>>> >> > >>>> >> > Thanks, >>>> >> > -- >>>> >> > kou >>>> >> > >>>> >> > >>>> >> > In <20240522.113708.2023905028549001143....@clear-code.com> >>>> >> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May >>>> >> 2024 11:37:08 +0900 (JST), >>>> >> > Sutou Kouhei <k...@clear-code.com> wrote: >>>> >> > >>>> >> >> Hi, >>>> >> >> >>>> >> >> We're discussing how to provide statistics through the C >>>> >> >> data interface at: >>>> >> >> https://github.com/apache/arrow/issues/38837 >>>> >> >> >>>> >> >> If you're interested in this feature, could you share your >>>> >> >> comments? >>>> >> >> >>>> >> >> >>>> >> >> Motivation: >>>> >> >> >>>> >> >> We can interchange Apache Arrow data by the C data interface >>>> >> >> in the same process. For example, we can pass Apache Arrow >>>> >> >> data read by Apache Arrow C++ (provider) to DuckDB >>>> >> >> (consumer) through the C data interface. >>>> >> >> >>>> >> >> A provider may know Apache Arrow data statistics. For >>>> >> >> example, a provider can know statistics when it reads Apache >>>> >> >> Parquet data because Apache Parquet may provide statistics. >>>> >> >> >>>> >> >> But a consumer can't know statistics that are known by a >>>> >> >> producer. Because there isn't a standard way to provide >>>> >> >> statistics through the C data interface. If a consumer can >>>> >> >> know statistics, it can process Apache Arrow data faster >>>> >> >> based on statistics. >>>> >> >> >>>> >> >> >>>> >> >> Proposal: >>>> >> >> >>>> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >>>> >> >> >>>> >> >> How about providing statistics as a metadata in ArrowSchema? >>>> >> >> >>>> >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >>>> >> >> >>>> >> >> >>>> >> >>>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >>>> >> >> >>>> >> >>> The ARROW pattern is a reserved namespace for internal >>>> >> >>> Arrow use in the custom_metadata fields. For example, >>>> >> >>> ARROW:extension:name. >>>> >> >> >>>> >> >> So we can use "ARROW:statistics" for the metadata key. >>>> >> >> >>>> >> >> We can represent statistics as a ArrowArray like ADBC does. >>>> >> >> >>>> >> >> Here is an example ArrowSchema that is for a record batch >>>> >> >> that has "int32 column1" and "string column2": >>>> >> >> >>>> >> >> ArrowSchema { >>>> >> >> .format = "+siu", >>>> >> >> .metadata = { >>>> >> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such >>>> >> as row count */ >>>> >> >> }, >>>> >> >> .children = { >>>> >> >> ArrowSchema { >>>> >> >> .name = "column1", >>>> >> >> .format = "i", >>>> >> >> .metadata = { >>>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics >>>> >> such as count distinct */ >>>> >> >> }, >>>> >> >> }, >>>> >> >> ArrowSchema { >>>> >> >> .name = "column2", >>>> >> >> .format = "u", >>>> >> >> .metadata = { >>>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics >>>> >> such as count distinct */ >>>> >> >> }, >>>> >> >> }, >>>> >> >> }, >>>> >> >> } >>>> >> >> >>>> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >>>> >> >> => ArrowArray*' is a base 10 string of the address of the >>>> >> >> ArrowArray. Because we can use only string for metadata >>>> >> >> value. You can't release the statistics ArrowArray*. (Its >>>> >> >> release is a no-op function.) It follows >>>> >> >> >>>> >> >>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >>>> >> >> semantics. (The base ArrowSchema owns statistics >>>> >> >> ArrowArray*.) >>>> >> >> >>>> >> >> >>>> >> >> ArrowArray* for statistics use the following schema: >>>> >> >> >>>> >> >> | Field Name | Field Type | Comments | >>>> >> >> |----------------|----------------------------------| -------- | >>>> >> >> | key | string not null | (1) | >>>> >> >> | value | `VALUE_SCHEMA` not null | | >>>> >> >> | is_approximate | bool not null | (2) | >>>> >> >> >>>> >> >> 1. We'll provide pre-defined keys such as "max", "min", >>>> >> >> "byte_width" and "distinct_count" but users can also use >>>> >> >> application specific keys. >>>> >> >> >>>> >> >> 2. If true, then the value is approximate or best-effort. >>>> >> >> >>>> >> >> VALUE_SCHEMA is a dense union with members: >>>> >> >> >>>> >> >> | Field Name | Field Type | Comments | >>>> >> >> |------------|----------------------------------| -------- | >>>> >> >> | int64 | int64 | | >>>> >> >> | uint64 | uint64 | | >>>> >> >> | float64 | float64 | | >>>> >> >> | value | The same type of the ArrowSchema | (3) | >>>> >> >> | | that is belonged to. | | >>>> >> >> >>>> >> >> 3. If the ArrowSchema's type is string, this type is also string. >>>> >> >> >>>> >> >> TODO: Is "value" good name? If we refer it from the >>>> >> >> top-level statistics schema, we need to use >>>> >> >> "value.value". It's a bit strange... >>>> >> >> >>>> >> >> >>>> >> >> What do you think about this proposal? Could you share your >>>> >> >> comments? >>>> >> >> >>>> >> >> >>>> >> >> Thanks, >>>> >> >> -- >>>> >> >> kou >>>> >> >>>>