Hi, >> map<struct<int32, utf8>, >> dense_union<...needed types based on stat kinds in the keys...>> >> > > Yes. That's my suggestion. And to leverage the fact that libraries handles > unions gracefully, this could be: > > map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds > in the keys...>> > > X is either sparse or dense. > > A possible alternative is to use a custom struct instead of map and reduce > the levels of nesting: > > struct<int32, utf8, dense_union<...needed types base on the keys...>>
Thanks for clarifying your suggestion. If we need utf8 for non-standard statistics, I think that map<utf8, ...> or map<dictionary<int32, utf8>, ...> is better as Antoine said. Because they are simpler than int32+utf8. Thanks, -- kou In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul 2024 14:17:46 -0300, Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: > On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> > for non-standard statistics from open-source products the >> key=0 >> > combined with string label is the way to go >> >> Where do we store the string label? >> >> I think that we're considering the following schema: >> >> >> map< >> >> // The column index or null if the statistics refer to whole table or >> batch. >> >> column: int32, >> >> // Statistics key is int32. >> >> // Different keys are assigned for exact value and >> >> // approximate value. >> >> map<int32, dense_union<...needed types based on stat kinds in the >> keys...>> >> >> > >> >> Are you considering the following schema for key=0 case? >> >> map<struct<int32, utf8>, >> dense_union<...needed types based on stat kinds in the keys...>> >> > > Yes. That's my suggestion. And to leverage the fact that libraries handles > unions gracefully, this could be: > > map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds > in the keys...>> > > X is either sparse or dense. > > A possible alternative is to use a custom struct instead of map and reduce > the levels of nesting: > > struct<int32, utf8, dense_union<...needed types base on the keys...>> > > -- > Felipe > > >> >> Thanks, >> -- >> kou >> >> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com> >> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul >> 2024 11:58:44 -0300, >> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >> >> > Hi, >> > >> > You can promise that well-known int32 statistic keys won't ever be higher >> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in >> [0, >> > 2^10)) but for non-standard statistics from open-source products the >> key=0 >> > combined with string label is the way to go, otherwise collisions would >> be >> > inevitable and everyone would hate us for having integer keys. >> > >> > This is not a very weird proposal from my part because integer keys >> > representing labels are common in most low-level standardized C APIs >> (e.g. >> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to >> > map all these keys to strings, but with them we keep the "C Data >> Interface" >> > low-level and portable as it should be. >> > >> > -- >> > Felipe >> > >> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums >> > because C limits enum to 2^16 and that is *not enough*. >> > >> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote: >> > >> >> Hi, >> >> >> >> Here is an updated summary so far: >> >> >> >> ---- >> >> Use cases: >> >> >> >> * Optimize query plan: e.g. JOIN for DuckDB >> >> >> >> Out of scope: >> >> >> >> * Transmit statistics through not the C data interface >> >> Examples: >> >> * Transmit statistics through Apache Arrow IPC file >> >> * Transmit statistics through Apache Arrow Flight >> >> * Multi-column statistics >> >> * Constraints information >> >> * Indexes information >> >> >> >> Discussing approach: >> >> >> >> Standardize Apache Arrow schema for statistics and transmit >> >> statistics via separated API call that uses the C data >> >> interface. >> >> >> >> This also works for per-batch statistics. >> >> >> >> Candidate schema: >> >> >> >> map< >> >> // The column index or null if the statistics refer to whole table or >> >> batch. >> >> column: int32, >> >> // Statistics key is int32. >> >> // Different keys are assigned for exact value and >> >> // approximate value. >> >> map<int32, dense_union<...needed types based on stat kinds in the >> >> keys...>> >> >> > >> >> >> >> Discussions: >> >> >> >> 1. Can we use int32 for statistic keys? >> >> Should we use utf8 (or dictionary<int32, utf8>) for >> >> statistic keys? >> >> 2. Hot to support non-standard (vendor-specific) >> >> statistic keys? >> >> ---- >> >> >> >> Here is my idea: >> >> >> >> 1. We can use int32 for statistic keys. >> >> 2. We can reserve a specific range for non-standard >> >> statistic keys. Prerequisites of this: >> >> * There is no use case to merge some statistics for >> >> the same data. >> >> * We can't merge statistics for different data. >> >> >> >> If the prerequisites aren't satisfied: >> >> >> >> 1. We should use utf8 (or dictionary<int32, utf8>) for >> >> statistic keys? >> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for >> >> standard statistic keys or use prefix such as >> >> "vendor1:"/"vendor1." for non-standard statistic keys. >> >> >> >> Here is Felipe's idea: >> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 >> >> >> >> 1. We can use int32 for statistic keys. >> >> 2. We can use the special statistic key + a string identifier >> >> for non-standard statistic keys. >> >> >> >> >> >> What do you think about this? >> >> >> >> >> >> Thanks, >> >> -- >> >> kou >> >> >> >> In <20240606.182727.1004633558059795207....@clear-code.com> >> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun >> >> 2024 18:27:27 +0900 (JST), >> >> Sutou Kouhei <k...@clear-code.com> wrote: >> >> >> >> > Hi, >> >> > >> >> > Thanks for sharing your comments. Here is a summary so far: >> >> > >> >> > ---- >> >> > >> >> > Use cases: >> >> > >> >> > * Optimize query plan: e.g. JOIN for DuckDB >> >> > >> >> > Out of scope: >> >> > >> >> > * Transmit statistics through not the C data interface >> >> > Examples: >> >> > * Transmit statistics through Apache Arrow IPC file >> >> > * Transmit statistics through Apache Arrow Flight >> >> > >> >> > Candidate approaches: >> >> > >> >> > 1. Pass statistics (encoded as an Apache Arrow data) via >> >> > ArrowSchema metadata >> >> > * This embeds statistics address into metadata >> >> > * It's for avoiding using Apache Arrow IPC format with >> >> > the C data interface >> >> > 2. Embed statistics (encoded as an Apache Arrow data) into >> >> > ArrowSchema metadata >> >> > * This adds statistics to metadata in Apache Arrow IPC >> >> > format >> >> > 3. Embed statistics (encoded as JSON) into ArrowArray >> >> > metadata >> >> > 4. Standardize Apache Arrow schema for statistics and >> >> > transmit statistics via separated API call that uses the >> >> > C data interface >> >> > 5. Use ADBC >> >> > >> >> > ---- >> >> > >> >> > I think that 4. is the best approach in these candidates. >> >> > >> >> > 1. Embedding statistics address is tricky. >> >> > 2. Consumers need to parse Apache Arrow IPC format data. >> >> > (The C data interface consumers may not have the >> >> > feature.) >> >> > 3. This will work but 4. is more generic. >> >> > 5. ADBC is too large to use only for statistics. >> >> > >> >> > What do you think about this? >> >> > >> >> > >> >> > If we select 4., we need to standardize Apache Arrow schema >> >> > for statistics. How about the following schema? >> >> > >> >> > ---- >> >> > Metadata: >> >> > >> >> > | Name | Value | Comments | >> >> > |----------------------------|-------|--------- | >> >> > | ARROW::statistics::version | 1.0.0 | (1) | >> >> > >> >> > (1) This follows semantic versioning. >> >> > >> >> > Fields: >> >> > >> >> > | Name | Type | Comments | >> >> > |----------------|-----------------------| -------- | >> >> > | column | utf8 | (2) | >> >> > | key | utf8 not null | (3) | >> >> > | value | VALUE_SCHEMA not null | | >> >> > | is_approximate | bool not null | (4) | >> >> > >> >> > (2) If null, then the statistic applies to the entire table. >> >> > It's for "row_count". >> >> > (3) We'll provide pre-defined keys such as "max", "min", >> >> > "byte_width" and "distinct_count" but users can also use >> >> > application specific keys. >> >> > (4) If true, then the value is approximate or best-effort. >> >> > >> >> > VALUE_SCHEMA is a dense union with members: >> >> > >> >> > | Name | Type | >> >> > |---------|---------| >> >> > | int64 | int64 | >> >> > | uint64 | uint64 | >> >> > | float64 | float64 | >> >> > | binary | binary | >> >> > >> >> > If a column is an int32 column, it uses int64 for >> >> > "max"/"min". We don't provide all types here. Users should >> >> > use a compatible type (int64 for a int32 column) instead. >> >> > ---- >> >> > >> >> > >> >> > Thanks, >> >> > -- >> >> > kou >> >> > >> >> > >> >> > In <20240522.113708.2023905028549001143....@clear-code.com> >> >> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May >> >> 2024 11:37:08 +0900 (JST), >> >> > Sutou Kouhei <k...@clear-code.com> wrote: >> >> > >> >> >> Hi, >> >> >> >> >> >> We're discussing how to provide statistics through the C >> >> >> data interface at: >> >> >> https://github.com/apache/arrow/issues/38837 >> >> >> >> >> >> If you're interested in this feature, could you share your >> >> >> comments? >> >> >> >> >> >> >> >> >> Motivation: >> >> >> >> >> >> We can interchange Apache Arrow data by the C data interface >> >> >> in the same process. For example, we can pass Apache Arrow >> >> >> data read by Apache Arrow C++ (provider) to DuckDB >> >> >> (consumer) through the C data interface. >> >> >> >> >> >> A provider may know Apache Arrow data statistics. For >> >> >> example, a provider can know statistics when it reads Apache >> >> >> Parquet data because Apache Parquet may provide statistics. >> >> >> >> >> >> But a consumer can't know statistics that are known by a >> >> >> producer. Because there isn't a standard way to provide >> >> >> statistics through the C data interface. If a consumer can >> >> >> know statistics, it can process Apache Arrow data faster >> >> >> based on statistics. >> >> >> >> >> >> >> >> >> Proposal: >> >> >> >> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >> >> >> >> >> >> How about providing statistics as a metadata in ArrowSchema? >> >> >> >> >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >> >> >> >> >> >> >> >> >> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >> >> >> >> >> >>> The ARROW pattern is a reserved namespace for internal >> >> >>> Arrow use in the custom_metadata fields. For example, >> >> >>> ARROW:extension:name. >> >> >> >> >> >> So we can use "ARROW:statistics" for the metadata key. >> >> >> >> >> >> We can represent statistics as a ArrowArray like ADBC does. >> >> >> >> >> >> Here is an example ArrowSchema that is for a record batch >> >> >> that has "int32 column1" and "string column2": >> >> >> >> >> >> ArrowSchema { >> >> >> .format = "+siu", >> >> >> .metadata = { >> >> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such >> >> as row count */ >> >> >> }, >> >> >> .children = { >> >> >> ArrowSchema { >> >> >> .name = "column1", >> >> >> .format = "i", >> >> >> .metadata = { >> >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics >> >> such as count distinct */ >> >> >> }, >> >> >> }, >> >> >> ArrowSchema { >> >> >> .name = "column2", >> >> >> .format = "u", >> >> >> .metadata = { >> >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics >> >> such as count distinct */ >> >> >> }, >> >> >> }, >> >> >> }, >> >> >> } >> >> >> >> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >> >> >> => ArrowArray*' is a base 10 string of the address of the >> >> >> ArrowArray. Because we can use only string for metadata >> >> >> value. You can't release the statistics ArrowArray*. (Its >> >> >> release is a no-op function.) It follows >> >> >> >> >> >> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >> >> >> semantics. (The base ArrowSchema owns statistics >> >> >> ArrowArray*.) >> >> >> >> >> >> >> >> >> ArrowArray* for statistics use the following schema: >> >> >> >> >> >> | Field Name | Field Type | Comments | >> >> >> |----------------|----------------------------------| -------- | >> >> >> | key | string not null | (1) | >> >> >> | value | `VALUE_SCHEMA` not null | | >> >> >> | is_approximate | bool not null | (2) | >> >> >> >> >> >> 1. We'll provide pre-defined keys such as "max", "min", >> >> >> "byte_width" and "distinct_count" but users can also use >> >> >> application specific keys. >> >> >> >> >> >> 2. If true, then the value is approximate or best-effort. >> >> >> >> >> >> VALUE_SCHEMA is a dense union with members: >> >> >> >> >> >> | Field Name | Field Type | Comments | >> >> >> |------------|----------------------------------| -------- | >> >> >> | int64 | int64 | | >> >> >> | uint64 | uint64 | | >> >> >> | float64 | float64 | | >> >> >> | value | The same type of the ArrowSchema | (3) | >> >> >> | | that is belonged to. | | >> >> >> >> >> >> 3. If the ArrowSchema's type is string, this type is also string. >> >> >> >> >> >> TODO: Is "value" good name? If we refer it from the >> >> >> top-level statistics schema, we need to use >> >> >> "value.value". It's a bit strange... >> >> >> >> >> >> >> >> >> What do you think about this proposal? Could you share your >> >> >> comments? >> >> >> >> >> >> >> >> >> Thanks, >> >> >> -- >> >> >> kou >> >> >>