Hi, I'm implementing C++ producer for statistics array: https://github.com/apache/arrow/pull/44252
Our discussed statistics schema is compact: https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R77-R145 But it may be a bit complex to build. What do you think about this? Thanks, -- kou In <20240805.183331.1066091419162501890....@clear-code.com> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 05 Aug 2024 18:33:31 +0900 (JST), Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > I've opened a PR for documentation: > https://github.com/apache/arrow/pull/43553 > > The "Example" section isn't written yet but suggestions are > very welcome. > > Thanks, > -- > kou > > In <20240725.143518.421507820763165665....@clear-code.com> > "Re: [DISCUSS] Statistics through the C data interface" on Thu, 25 Jul 2024 > 14:35:18 +0900 (JST), > Sutou Kouhei <k...@clear-code.com> wrote: > >> Hi, >> >> If nobody objects using utf8 or dictionary<int32, utf8> for >> statistics key, let's use dictionary<int32, utf8>. Because >> dictionary<int32, utf8> will be more effective than utf8 >> when there are many columns. >> >> I'll start writing a documentation for this and implementing >> this for C++ next week. I'll share a PR for them when I >> complete them. We can start a vote for this after we review >> the PR. >> >> >> Thanks, >> -- >> kou >> >> In <20240712.151536.312169170508271330....@clear-code.com> >> "Re: [DISCUSS] Statistics through the C data interface" on Fri, 12 Jul >> 2024 15:15:36 +0900 (JST), >> Sutou Kouhei <k...@clear-code.com> wrote: >> >>> Hi, >>> >>>>> map<struct<int32, utf8>, >>>>> dense_union<...needed types based on stat kinds in the keys...>> >>>>> >>>> >>>> Yes. That's my suggestion. And to leverage the fact that libraries handles >>>> unions gracefully, this could be: >>>> >>>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >>>> in the keys...>> >>>> >>>> X is either sparse or dense. >>>> >>>> A possible alternative is to use a custom struct instead of map and reduce >>>> the levels of nesting: >>>> >>>> struct<int32, utf8, dense_union<...needed types base on the keys...>> >>> >>> Thanks for clarifying your suggestion. >>> >>> If we need utf8 for non-standard statistics, I think that >>> map<utf8, ...> or map<dictionary<int32, utf8>, ...> is >>> better as Antoine said. Because they are simpler than >>> int32+utf8. >>> >>> >>> Thanks, >>> -- >>> kou >>> >>> In <caoc8yxy_e9w5sky572phktt-tdguc9v3xpxiom1xohp8rq7...@mail.gmail.com> >>> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 11 Jul >>> 2024 14:17:46 -0300, >>> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >>> >>>> On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> > for non-standard statistics from open-source products the >>>>> key=0 >>>>> > combined with string label is the way to go >>>>> >>>>> Where do we store the string label? >>>>> >>>>> I think that we're considering the following schema: >>>>> >>>>> >> map< >>>>> >> // The column index or null if the statistics refer to whole table or >>>>> batch. >>>>> >> column: int32, >>>>> >> // Statistics key is int32. >>>>> >> // Different keys are assigned for exact value and >>>>> >> // approximate value. >>>>> >> map<int32, dense_union<...needed types based on stat kinds in the >>>>> keys...>> >>>>> >> > >>>>> >>>>> Are you considering the following schema for key=0 case? >>>>> >>>>> map<struct<int32, utf8>, >>>>> dense_union<...needed types based on stat kinds in the keys...>> >>>>> >>>> >>>> Yes. That's my suggestion. And to leverage the fact that libraries handles >>>> unions gracefully, this could be: >>>> >>>> map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds >>>> in the keys...>> >>>> >>>> X is either sparse or dense. >>>> >>>> A possible alternative is to use a custom struct instead of map and reduce >>>> the levels of nesting: >>>> >>>> struct<int32, utf8, dense_union<...needed types base on the keys...>> >>>> >>>> -- >>>> Felipe >>>> >>>> >>>>> >>>>> Thanks, >>>>> -- >>>>> kou >>>>> >>>>> In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com> >>>>> "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul >>>>> 2024 11:58:44 -0300, >>>>> Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: >>>>> >>>>> > Hi, >>>>> > >>>>> > You can promise that well-known int32 statistic keys won't ever be >>>>> > higher >>>>> > than a certain value (2^18) [1] like TCP IP ports (well-known ports in >>>>> [0, >>>>> > 2^10)) but for non-standard statistics from open-source products the >>>>> key=0 >>>>> > combined with string label is the way to go, otherwise collisions would >>>>> be >>>>> > inevitable and everyone would hate us for having integer keys. >>>>> > >>>>> > This is not a very weird proposal from my part because integer keys >>>>> > representing labels are common in most low-level standardized C APIs >>>>> (e.g. >>>>> > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs >>>>> > to >>>>> > map all these keys to strings, but with them we keep the "C Data >>>>> Interface" >>>>> > low-level and portable as it should be. >>>>> > >>>>> > -- >>>>> > Felipe >>>>> > >>>>> > [1] 2^16 is too small. For instance, OpenGL constants can't be enums >>>>> > because C limits enum to 2^16 and that is *not enough*. >>>>> > >>>>> > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> >>>>> > wrote: >>>>> > >>>>> >> Hi, >>>>> >> >>>>> >> Here is an updated summary so far: >>>>> >> >>>>> >> ---- >>>>> >> Use cases: >>>>> >> >>>>> >> * Optimize query plan: e.g. JOIN for DuckDB >>>>> >> >>>>> >> Out of scope: >>>>> >> >>>>> >> * Transmit statistics through not the C data interface >>>>> >> Examples: >>>>> >> * Transmit statistics through Apache Arrow IPC file >>>>> >> * Transmit statistics through Apache Arrow Flight >>>>> >> * Multi-column statistics >>>>> >> * Constraints information >>>>> >> * Indexes information >>>>> >> >>>>> >> Discussing approach: >>>>> >> >>>>> >> Standardize Apache Arrow schema for statistics and transmit >>>>> >> statistics via separated API call that uses the C data >>>>> >> interface. >>>>> >> >>>>> >> This also works for per-batch statistics. >>>>> >> >>>>> >> Candidate schema: >>>>> >> >>>>> >> map< >>>>> >> // The column index or null if the statistics refer to whole table or >>>>> >> batch. >>>>> >> column: int32, >>>>> >> // Statistics key is int32. >>>>> >> // Different keys are assigned for exact value and >>>>> >> // approximate value. >>>>> >> map<int32, dense_union<...needed types based on stat kinds in the >>>>> >> keys...>> >>>>> >> > >>>>> >> >>>>> >> Discussions: >>>>> >> >>>>> >> 1. Can we use int32 for statistic keys? >>>>> >> Should we use utf8 (or dictionary<int32, utf8>) for >>>>> >> statistic keys? >>>>> >> 2. Hot to support non-standard (vendor-specific) >>>>> >> statistic keys? >>>>> >> ---- >>>>> >> >>>>> >> Here is my idea: >>>>> >> >>>>> >> 1. We can use int32 for statistic keys. >>>>> >> 2. We can reserve a specific range for non-standard >>>>> >> statistic keys. Prerequisites of this: >>>>> >> * There is no use case to merge some statistics for >>>>> >> the same data. >>>>> >> * We can't merge statistics for different data. >>>>> >> >>>>> >> If the prerequisites aren't satisfied: >>>>> >> >>>>> >> 1. We should use utf8 (or dictionary<int32, utf8>) for >>>>> >> statistic keys? >>>>> >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for >>>>> >> standard statistic keys or use prefix such as >>>>> >> "vendor1:"/"vendor1." for non-standard statistic keys. >>>>> >> >>>>> >> Here is Felipe's idea: >>>>> >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 >>>>> >> >>>>> >> 1. We can use int32 for statistic keys. >>>>> >> 2. We can use the special statistic key + a string identifier >>>>> >> for non-standard statistic keys. >>>>> >> >>>>> >> >>>>> >> What do you think about this? >>>>> >> >>>>> >> >>>>> >> Thanks, >>>>> >> -- >>>>> >> kou >>>>> >> >>>>> >> In <20240606.182727.1004633558059795207....@clear-code.com> >>>>> >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 >>>>> >> Jun >>>>> >> 2024 18:27:27 +0900 (JST), >>>>> >> Sutou Kouhei <k...@clear-code.com> wrote: >>>>> >> >>>>> >> > Hi, >>>>> >> > >>>>> >> > Thanks for sharing your comments. Here is a summary so far: >>>>> >> > >>>>> >> > ---- >>>>> >> > >>>>> >> > Use cases: >>>>> >> > >>>>> >> > * Optimize query plan: e.g. JOIN for DuckDB >>>>> >> > >>>>> >> > Out of scope: >>>>> >> > >>>>> >> > * Transmit statistics through not the C data interface >>>>> >> > Examples: >>>>> >> > * Transmit statistics through Apache Arrow IPC file >>>>> >> > * Transmit statistics through Apache Arrow Flight >>>>> >> > >>>>> >> > Candidate approaches: >>>>> >> > >>>>> >> > 1. Pass statistics (encoded as an Apache Arrow data) via >>>>> >> > ArrowSchema metadata >>>>> >> > * This embeds statistics address into metadata >>>>> >> > * It's for avoiding using Apache Arrow IPC format with >>>>> >> > the C data interface >>>>> >> > 2. Embed statistics (encoded as an Apache Arrow data) into >>>>> >> > ArrowSchema metadata >>>>> >> > * This adds statistics to metadata in Apache Arrow IPC >>>>> >> > format >>>>> >> > 3. Embed statistics (encoded as JSON) into ArrowArray >>>>> >> > metadata >>>>> >> > 4. Standardize Apache Arrow schema for statistics and >>>>> >> > transmit statistics via separated API call that uses the >>>>> >> > C data interface >>>>> >> > 5. Use ADBC >>>>> >> > >>>>> >> > ---- >>>>> >> > >>>>> >> > I think that 4. is the best approach in these candidates. >>>>> >> > >>>>> >> > 1. Embedding statistics address is tricky. >>>>> >> > 2. Consumers need to parse Apache Arrow IPC format data. >>>>> >> > (The C data interface consumers may not have the >>>>> >> > feature.) >>>>> >> > 3. This will work but 4. is more generic. >>>>> >> > 5. ADBC is too large to use only for statistics. >>>>> >> > >>>>> >> > What do you think about this? >>>>> >> > >>>>> >> > >>>>> >> > If we select 4., we need to standardize Apache Arrow schema >>>>> >> > for statistics. How about the following schema? >>>>> >> > >>>>> >> > ---- >>>>> >> > Metadata: >>>>> >> > >>>>> >> > | Name | Value | Comments | >>>>> >> > |----------------------------|-------|--------- | >>>>> >> > | ARROW::statistics::version | 1.0.0 | (1) | >>>>> >> > >>>>> >> > (1) This follows semantic versioning. >>>>> >> > >>>>> >> > Fields: >>>>> >> > >>>>> >> > | Name | Type | Comments | >>>>> >> > |----------------|-----------------------| -------- | >>>>> >> > | column | utf8 | (2) | >>>>> >> > | key | utf8 not null | (3) | >>>>> >> > | value | VALUE_SCHEMA not null | | >>>>> >> > | is_approximate | bool not null | (4) | >>>>> >> > >>>>> >> > (2) If null, then the statistic applies to the entire table. >>>>> >> > It's for "row_count". >>>>> >> > (3) We'll provide pre-defined keys such as "max", "min", >>>>> >> > "byte_width" and "distinct_count" but users can also use >>>>> >> > application specific keys. >>>>> >> > (4) If true, then the value is approximate or best-effort. >>>>> >> > >>>>> >> > VALUE_SCHEMA is a dense union with members: >>>>> >> > >>>>> >> > | Name | Type | >>>>> >> > |---------|---------| >>>>> >> > | int64 | int64 | >>>>> >> > | uint64 | uint64 | >>>>> >> > | float64 | float64 | >>>>> >> > | binary | binary | >>>>> >> > >>>>> >> > If a column is an int32 column, it uses int64 for >>>>> >> > "max"/"min". We don't provide all types here. Users should >>>>> >> > use a compatible type (int64 for a int32 column) instead. >>>>> >> > ---- >>>>> >> > >>>>> >> > >>>>> >> > Thanks, >>>>> >> > -- >>>>> >> > kou >>>>> >> > >>>>> >> > >>>>> >> > In <20240522.113708.2023905028549001143....@clear-code.com> >>>>> >> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May >>>>> >> 2024 11:37:08 +0900 (JST), >>>>> >> > Sutou Kouhei <k...@clear-code.com> wrote: >>>>> >> > >>>>> >> >> Hi, >>>>> >> >> >>>>> >> >> We're discussing how to provide statistics through the C >>>>> >> >> data interface at: >>>>> >> >> https://github.com/apache/arrow/issues/38837 >>>>> >> >> >>>>> >> >> If you're interested in this feature, could you share your >>>>> >> >> comments? >>>>> >> >> >>>>> >> >> >>>>> >> >> Motivation: >>>>> >> >> >>>>> >> >> We can interchange Apache Arrow data by the C data interface >>>>> >> >> in the same process. For example, we can pass Apache Arrow >>>>> >> >> data read by Apache Arrow C++ (provider) to DuckDB >>>>> >> >> (consumer) through the C data interface. >>>>> >> >> >>>>> >> >> A provider may know Apache Arrow data statistics. For >>>>> >> >> example, a provider can know statistics when it reads Apache >>>>> >> >> Parquet data because Apache Parquet may provide statistics. >>>>> >> >> >>>>> >> >> But a consumer can't know statistics that are known by a >>>>> >> >> producer. Because there isn't a standard way to provide >>>>> >> >> statistics through the C data interface. If a consumer can >>>>> >> >> know statistics, it can process Apache Arrow data faster >>>>> >> >> based on statistics. >>>>> >> >> >>>>> >> >> >>>>> >> >> Proposal: >>>>> >> >> >>>>> >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 >>>>> >> >> >>>>> >> >> How about providing statistics as a metadata in ArrowSchema? >>>>> >> >> >>>>> >> >> We reserve "ARROW" namespace for internal Apache Arrow use: >>>>> >> >> >>>>> >> >> >>>>> >> >>>>> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata >>>>> >> >> >>>>> >> >>> The ARROW pattern is a reserved namespace for internal >>>>> >> >>> Arrow use in the custom_metadata fields. For example, >>>>> >> >>> ARROW:extension:name. >>>>> >> >> >>>>> >> >> So we can use "ARROW:statistics" for the metadata key. >>>>> >> >> >>>>> >> >> We can represent statistics as a ArrowArray like ADBC does. >>>>> >> >> >>>>> >> >> Here is an example ArrowSchema that is for a record batch >>>>> >> >> that has "int32 column1" and "string column2": >>>>> >> >> >>>>> >> >> ArrowSchema { >>>>> >> >> .format = "+siu", >>>>> >> >> .metadata = { >>>>> >> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics >>>>> >> >> such >>>>> >> as row count */ >>>>> >> >> }, >>>>> >> >> .children = { >>>>> >> >> ArrowSchema { >>>>> >> >> .name = "column1", >>>>> >> >> .format = "i", >>>>> >> >> .metadata = { >>>>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level >>>>> >> >> statistics >>>>> >> such as count distinct */ >>>>> >> >> }, >>>>> >> >> }, >>>>> >> >> ArrowSchema { >>>>> >> >> .name = "column2", >>>>> >> >> .format = "u", >>>>> >> >> .metadata = { >>>>> >> >> "ARROW:statistics" => ArrowArray*, /* column-level >>>>> >> >> statistics >>>>> >> such as count distinct */ >>>>> >> >> }, >>>>> >> >> }, >>>>> >> >> }, >>>>> >> >> } >>>>> >> >> >>>>> >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" >>>>> >> >> => ArrowArray*' is a base 10 string of the address of the >>>>> >> >> ArrowArray. Because we can use only string for metadata >>>>> >> >> value. You can't release the statistics ArrowArray*. (Its >>>>> >> >> release is a no-op function.) It follows >>>>> >> >> >>>>> >> >>>>> https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation >>>>> >> >> semantics. (The base ArrowSchema owns statistics >>>>> >> >> ArrowArray*.) >>>>> >> >> >>>>> >> >> >>>>> >> >> ArrowArray* for statistics use the following schema: >>>>> >> >> >>>>> >> >> | Field Name | Field Type | Comments | >>>>> >> >> |----------------|----------------------------------| -------- | >>>>> >> >> | key | string not null | (1) | >>>>> >> >> | value | `VALUE_SCHEMA` not null | | >>>>> >> >> | is_approximate | bool not null | (2) | >>>>> >> >> >>>>> >> >> 1. We'll provide pre-defined keys such as "max", "min", >>>>> >> >> "byte_width" and "distinct_count" but users can also use >>>>> >> >> application specific keys. >>>>> >> >> >>>>> >> >> 2. If true, then the value is approximate or best-effort. >>>>> >> >> >>>>> >> >> VALUE_SCHEMA is a dense union with members: >>>>> >> >> >>>>> >> >> | Field Name | Field Type | Comments | >>>>> >> >> |------------|----------------------------------| -------- | >>>>> >> >> | int64 | int64 | | >>>>> >> >> | uint64 | uint64 | | >>>>> >> >> | float64 | float64 | | >>>>> >> >> | value | The same type of the ArrowSchema | (3) | >>>>> >> >> | | that is belonged to. | | >>>>> >> >> >>>>> >> >> 3. If the ArrowSchema's type is string, this type is also string. >>>>> >> >> >>>>> >> >> TODO: Is "value" good name? If we refer it from the >>>>> >> >> top-level statistics schema, we need to use >>>>> >> >> "value.value". It's a bit strange... >>>>> >> >> >>>>> >> >> >>>>> >> >> What do you think about this proposal? Could you share your >>>>> >> >> comments? >>>>> >> >> >>>>> >> >> >>>>> >> >> Thanks, >>>>> >> >> -- >>>>> >> >> kou >>>>> >> >>>>>