Hi, If nobody objects using utf8 or dictionary<int32, utf8> for statistics key, let's use dictionary<int32, utf8>. Because dictionary<int32, utf8> will be more effective than utf8 when there are many columns.
I'll start writing a documentation for this and implementing this for C++ next week. I'll share a PR for them when I complete them. We can start a vote for this after we review the PR. Thanks, -- kou In <20240712.151536.312169170508271330....@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul 2024 15:15:36 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > >>> map<struct<int32, utf8>, >>> dense_union<...needed types based on stat kinds in the keys...>> >>> >> >> Yes. That's my suggestion. And to leverage the fact that libraries handles >> unions gracefully, this could be: >> >> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >> in the keys...>> >> >> X is either sparse or dense. >> >> A possible alternative is to use a custom struct instead of map and reduce >> the levels of nesting: >> >> struct<int32, utf8, dense_union<...needed types base on the keys...>> > > Thanks for clarifying your suggestion. > > If we need utf8 for non-standard statistics, I think that > map<utf8, ...> or map<dictionary<int32, utf8>, ...> is > better as Antoine said. Because they are simpler than > int32+utf8. > > > Thanks, > -- > kou > > In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com> > "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul 2024 > 14:17:46 -0300, > Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: > >> On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote: >> >>> Hi, >>> >>> > for non-standard statistics from open-source products the >>> key=0 >>> > combined with string label is the way to go >>> >>> Where do we store the string label? >>> >>> I think that we're considering the following schema: >>> >>> >> map< >>> >> // The column index or null if the statistics refer to whole table or >>> batch. >>> >> column: int32, >>> >> // Statistics key is int32. >>> >> // Different keys are assigned for exact value and >>> >> // approximate value. >>> >> map<int32, dense_union<...needed types based on stat kinds in the >>> keys...>> >>> >> > >>> >>> Are you considering the following schema for key=0 case? >>> >>> map<struct<int32, utf8>, >>> dense_union<...needed types based on stat kinds in the keys...>> >>> >> >> Yes. That's my suggestion. And to leverage the fact that libraries handles >> unions gracefully, this could be: >> >> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >> in the keys...>> >> >> X is either sparse or dense. >> >> A possible alternative is to use a custom struct instead of map and reduce >> the levels of nesting: >> >> struct<int32, utf8, dense_union<...needed types base on the keys...>> >> >> -- >> Felipe >> >> >>> >>> Thanks, >>> -- >>> kou >>> >>> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com> >>> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul >>> 2024 11:58:44 -0300, >>> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >>> >>> > Hi, >>> > >>> > You can promise that well-known int32 statistic keys won't ever be higher >>> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in >>> [0, >>> > 2^10)) but for non-standard statistics from open-source products the >>> key=0 >>> > combined with string label is the way to go, otherwise collisions would >>> be >>> > inevitable and everyone would hate us for having integer keys. >>> > >>> > This is not a very weird proposal from my part because integer keys >>> > representing labels are common in most low-level standardized C APIs >>> (e.g. >>> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to >>> > map all these keys to strings, but with them we keep the "C Data >>> Interface" >>> > low-level and portable as it should be. >>> > >>> > -- >>> > Felipe >>> > >>> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums >>> > because C limits enum to 2^16 and that is *not enough*. >>> > >>> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote: >>> > >>> >> Hi, >>> >> >>> >> Here is an updated summary so far: >>> >> >>> >> ---- >>> >> Use cases: >>> >> >>> >> * Optimize query plan: e.g. JOIN for DuckDB >>> >> >>> >> Out of scope: >>> >> >>> >> * Transmit statistics through not the C data interface >>> >> Examples: >>> >> * Transmit statistics through Apache Arrow IPC file >>> >> * Transmit statistics through Apache Arrow Flight >>> >> * Multi-column statistics >>> >> * Constraints information >>> >> * Indexes information >>> >> >>> >> Discussing approach: >>> >> >>> >> Standardize Apache Arrow schema for statistics and transmit >>> >> statistics via separated API call that uses the C data >>> >> interface. >>> >> >>> >> This also works for per-batch statistics. >>> >> >>> >> Candidate schema: >>> >> >>> >> map< >>> >> // The column index or null if the statistics refer to whole table or >>> >> batch. >>> >> column: int32, >>> >> // Statistics key is int32. >>> >> // Different keys are assigned for exact value and >>> >> // approximate value. >>> >> map<int32, dense_union<...needed types based on stat kinds in the >>> >> keys...>> >>> >> > >>> >> >>> >> Discussions: >>> >> >>> >> 1. Can we use int32 for statistic keys? >>> >> Should we use utf8 (or dictionary<int32, utf8>) for >>> >> statistic keys? >>> >> 2. Hot to support non-standard (vendor-specific) >>> >> statistic keys? >>> >> ---- >>> >> >>> >> Here is my idea: >>> >> >>> >> 1. We can use int32 for statistic keys. >>> >> 2. We can reserve a specific range for non-standard >>> >> statistic keys. Prerequisites of this: >>> >> * There is no use case to merge some statistics for >>> >> the same data. >>> >> * We can't merge statistics for different data. >>> >> >>> >> If the prerequisites aren't satisfied: >>> >> >>> >> 1. We should use utf8 (or dictionary<int32, utf8>) for >>> >> statistic keys? >>> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for >>> >> standard statistic keys or use prefix such as >>> >> "vendor1:"/"vendor1." for non-standard statistic keys. >>> >> >>> >> Here is Felipe's idea: >>> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 >>> >> >>> >> 1. We can use int32 for statistic keys. >>> >> 2. We can use the special statistic key + a string identifier >>> >> for non-standard statistic keys. >>> >> >>> >> >>> >> What do you think about this? >>> >> >>> >> >>> >> Thanks, >>> >> -- >>> >> kou >>> >> >>> >> In <20240606.182727.1004633558059795207....@clear-code.com> >>> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun >>> >> 2024 18:27:27 +0900 (JST), >>> >> Sutou Kouhei <k...@clear-code.com> wrote: >>> >> >>> >> > Hi, >>> >> > >>> >> > Thanks for sharing your comments. Here is a summary so far: >>> >> > >>> >> > ---- >>> >> > >>> >> > Use cases: >>> >> > >>> >> > * Optimize query plan: e.g. JOIN for DuckDB >>> >> > >>> >> > Out of scope: >>> >> > >>> >> > * Transmit statistics through not the C data interface >>> >> > Examples: >>> >> > * Transmit statistics through Apache Arrow IPC file >>> >> > * Transmit statistics through Apache Arrow Flight >>> >> > >>> >> > Candidate approaches: >>> >> > >>> >> > 1. Pass statistics (encoded as an Apache Arrow data) via >>> >> > ArrowSchema metadata >>> >> > * This embeds statistics address into metadata >>> >> > * It's for avoiding using Apache Arrow IPC format with >>> >> > the C data interface >>> >> > 2. Embed statistics (encoded as an Apache Arrow data) into >>> >> > ArrowSchema metadata >>> >> > * This adds statistics to metadata in Apache Arrow IPC >>> >> > format >>> >> > 3. Embed statistics (encoded as JSON) into ArrowArray >>> >> > metadata >>> >> > 4. Standardize Apache Arrow schema for statistics and >>> >> > transmit statistics via separated API call that uses the >>> >> > C data interface >>> >> > 5. Use ADBC >>> >> > >>> >> > ---- >>> >> > >>> >> > I think that 4. is the best approach in these candidates. >>> >> > >>> >> > 1. Embedding statistics address is tricky. >>> >> > 2. Consumers need to parse Apache Arrow IPC format data. >>> >> > (The C data interface consumers may not have the >>> >> > feature.) >>> >> > 3. This will work but 4. is more generic. >>> >> > 5. ADBC is too large to use only for statistics. >>> >> > >>> >> > What do you think about this? >>> >> > >>> >> > >>> >> > If we select 4., we need to standardize Apache Arrow schema >>> >> > for statistics. How about the following schema? >>> >> > >>> >> > ---- >>> >> > Metadata: >>> >> > >>> >> > | Name | Value | Comments | >>> >> > |----------------------------|-------|--------- | >>> >> > | ARROW::statistics::version | 1.0.0 | (1) | >>> >> > >>> >> > (1) This follows semantic versioning. >>> >> > >>> >> > Fields: >>> >> > >>> >> > | Name | Type | Comments | >>> >> > |----------------|-----------------------| -------- | >>> >> > | column | utf8 | (2) | >>> >> > | key | utf8 not null | (3) | >>> >> > | value | VALUE_SCHEMA not null | | >>> >> > | is_approximate | bool not null | (4) | >>> >> > >>> >> > (2) If null, then the statistic applies to the entire table. >>> >> > It's for "row_count". >>> >> > (3) We'll provide pre-defined keys such as "max", "min", >>> >> > "byte_width" and "distinct_count" but users can also use >>> >> > application specific keys. >>> >> > (4) If true, then the value is approximate or best-effort. >>> >> > >>> >> > VALUE_SCHEMA is a dense union with members: >>> >> > >>> >> > | Name | Type | >>> >> > |---------|---------| >>> >> > | int64 | int64 | >>> >> > | uint64 | uint64 | >>> >> > | float64 | float64 | >>> >> > | binary | binary | >>> >> > >>> >> > If a column is an int32 column, it uses int64 for >>> >> > "max"/"min". We don't provide all types here. Users should >>> >> > use a compatible type (int64 for a int32 column) instead. >>> >> > ---- >>> >> > >>> >> > >>> >> > Thanks, >>> >> > -- >>> >> > kou >>> >> > >>> >> > >>> >> > In <20240522.113708.2023905028549001143....@clear-code.com> >>> >> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May >>> >> 2024 11:37:08 +0900 (JST), >>> >> > Sutou Kouhei <k...@clear-code.com> wrote: >>> >> > >>> >> >> Hi, >>> >> >> >>> >> >> We're discussing how to provide statistics through the C >>> >> >> data interface at: >>> >> >> https://github.com/apache/arrow/issues/38837 >>> >> >> >>> >> >> If you're interested in this feature, could you share your >>> >> >> comments? >>> >> >> >>> >> >> >>> >> >> Motivation: >>> >> >> >>> >> >> We can interchange Apache Arrow data by the C data interface >>> >> >> in the same process. For example, we can pass Apache Arrow >>> >> >> data read by Apache Arrow C++ (provider) to DuckDB >>> >> >> (consumer) through the C data interface. >>> >> >> >>> >> >> A provider may know Apache Arrow data statistics. For >>> >> >> example, a provider can know statistics when it reads Apache >>> >> >> Parquet data because Apache Parquet may provide statistics. >>> >> >> >>> >> >> But a consumer can't know statistics that are known by a >>> >> >> producer. Because there isn't a standard way to provide >>> >> >> statistics through the C data interface. If a consumer can >>> >> >> know statistics, it can process Apache Arrow data faster >>> >> >> based on statistics. >>> >> >> >>> >> >> >>> >> >> Proposal: >>> >> >> >>> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >>> >> >> >>> >> >> How about providing statistics as a metadata in ArrowSchema? >>> >> >> >>> >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >>> >> >> >>> >> >> >>> >> >>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >>> >> >> >>> >> >>> The ARROW pattern is a reserved namespace for internal >>> >> >>> Arrow use in the custom_metadata fields. For example, >>> >> >>> ARROW:extension:name. >>> >> >> >>> >> >> So we can use "ARROW:statistics" for the metadata key. >>> >> >> >>> >> >> We can represent statistics as a ArrowArray like ADBC does. >>> >> >> >>> >> >> Here is an example ArrowSchema that is for a record batch >>> >> >> that has "int32 column1" and "string column2": >>> >> >> >>> >> >> ArrowSchema { >>> >> >> .format = "+siu", >>> >> >> .metadata = { >>> >> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such >>> >> as row count */ >>> >> >> }, >>> >> >> .children = { >>> >> >> ArrowSchema { >>> >> >> .name = "column1", >>> >> >> .format = "i", >>> >> >> .metadata = { >>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics >>> >> such as count distinct */ >>> >> >> }, >>> >> >> }, >>> >> >> ArrowSchema { >>> >> >> .name = "column2", >>> >> >> .format = "u", >>> >> >> .metadata = { >>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics >>> >> such as count distinct */ >>> >> >> }, >>> >> >> }, >>> >> >> }, >>> >> >> } >>> >> >> >>> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >>> >> >> => ArrowArray*' is a base 10 string of the address of the >>> >> >> ArrowArray. Because we can use only string for metadata >>> >> >> value. You can't release the statistics ArrowArray*. (Its >>> >> >> release is a no-op function.) It follows >>> >> >> >>> >> >>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >>> >> >> semantics. (The base ArrowSchema owns statistics >>> >> >> ArrowArray*.) >>> >> >> >>> >> >> >>> >> >> ArrowArray* for statistics use the following schema: >>> >> >> >>> >> >> | Field Name | Field Type | Comments | >>> >> >> |----------------|----------------------------------| -------- | >>> >> >> | key | string not null | (1) | >>> >> >> | value | `VALUE_SCHEMA` not null | | >>> >> >> | is_approximate | bool not null | (2) | >>> >> >> >>> >> >> 1. We'll provide pre-defined keys such as "max", "min", >>> >> >> "byte_width" and "distinct_count" but users can also use >>> >> >> application specific keys. >>> >> >> >>> >> >> 2. If true, then the value is approximate or best-effort. >>> >> >> >>> >> >> VALUE_SCHEMA is a dense union with members: >>> >> >> >>> >> >> | Field Name | Field Type | Comments | >>> >> >> |------------|----------------------------------| -------- | >>> >> >> | int64 | int64 | | >>> >> >> | uint64 | uint64 | | >>> >> >> | float64 | float64 | | >>> >> >> | value | The same type of the ArrowSchema | (3) | >>> >> >> | | that is belonged to. | | >>> >> >> >>> >> >> 3. If the ArrowSchema's type is string, this type is also string. >>> >> >> >>> >> >> TODO: Is "value" good name? If we refer it from the >>> >> >> top-level statistics schema, we need to use >>> >> >> "value.value". It's a bit strange... >>> >> >> >>> >> >> >>> >> >> What do you think about this proposal? Could you share your >>> >> >> comments? >>> >> >> >>> >> >> >>> >> >> Thanks, >>> >> >> -- >>> >> >> kou >>> >> >>>