On Thu, Jul 11, 2024 at 5:04 AM Sutou Kouhei <k...@clear-code.com> wrote:
> Hi, > > > for non-standard statistics from open-source products the > key=0 > > combined with string label is the way to go > > Where do we store the string label? > > I think that we're considering the following schema: > > >> map< > >> // The column index or null if the statistics refer to whole table or > batch. > >> column: int32, > >> // Statistics key is int32. > >> // Different keys are assigned for exact value and > >> // approximate value. > >> map<int32, dense_union<...needed types based on stat kinds in the > keys...>> > >> > > > Are you considering the following schema for key=0 case? > > map<struct<int32, utf8>, > dense_union<...needed types based on stat kinds in the keys...>> > Yes. That's my suggestion. And to leverage the fact that libraries handles unions gracefully, this could be: map<X_union<int32, utf8>, dense_union<...needed types based on stat kinds in the keys...>> X is either sparse or dense. A possible alternative is to use a custom struct instead of map and reduce the levels of nesting: struct<int32, utf8, dense_union<...needed types base on the keys...>> -- Felipe > > Thanks, > -- > kou > > In <CAOC8YXYnePq=qfwvzhfqmoxgcubogbhb2gtmabmc7v-x2ap...@mail.gmail.com> > "Re: [DISCUSS] Statistics through the C data interface" on Mon, 1 Jul > 2024 11:58:44 -0300, > Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: > > > Hi, > > > > You can promise that well-known int32 statistic keys won't ever be higher > > than a certain value (2^18) [1] like TCP IP ports (well-known ports in > [0, > > 2^10)) but for non-standard statistics from open-source products the > key=0 > > combined with string label is the way to go, otherwise collisions would > be > > inevitable and everyone would hate us for having integer keys. > > > > This is not a very weird proposal from my part because integer keys > > representing labels are common in most low-level standardized C APIs > (e.g. > > linux syscalls, ioctls, OpenGL, Vulcan...). I expect higher level APIs to > > map all these keys to strings, but with them we keep the "C Data > Interface" > > low-level and portable as it should be. > > > > -- > > Felipe > > > > [1] 2^16 is too small. For instance, OpenGL constants can't be enums > > because C limits enum to 2^16 and that is *not enough*. > > > > On Thu, Jun 20, 2024 at 7:43 AM Sutou Kouhei <k...@clear-code.com> wrote: > > > >> Hi, > >> > >> Here is an updated summary so far: > >> > >> ---- > >> Use cases: > >> > >> * Optimize query plan: e.g. JOIN for DuckDB > >> > >> Out of scope: > >> > >> * Transmit statistics through not the C data interface > >> Examples: > >> * Transmit statistics through Apache Arrow IPC file > >> * Transmit statistics through Apache Arrow Flight > >> * Multi-column statistics > >> * Constraints information > >> * Indexes information > >> > >> Discussing approach: > >> > >> Standardize Apache Arrow schema for statistics and transmit > >> statistics via separated API call that uses the C data > >> interface. > >> > >> This also works for per-batch statistics. > >> > >> Candidate schema: > >> > >> map< > >> // The column index or null if the statistics refer to whole table or > >> batch. > >> column: int32, > >> // Statistics key is int32. > >> // Different keys are assigned for exact value and > >> // approximate value. > >> map<int32, dense_union<...needed types based on stat kinds in the > >> keys...>> > >> > > >> > >> Discussions: > >> > >> 1. Can we use int32 for statistic keys? > >> Should we use utf8 (or dictionary<int32, utf8>) for > >> statistic keys? > >> 2. Hot to support non-standard (vendor-specific) > >> statistic keys? > >> ---- > >> > >> Here is my idea: > >> > >> 1. We can use int32 for statistic keys. > >> 2. We can reserve a specific range for non-standard > >> statistic keys. Prerequisites of this: > >> * There is no use case to merge some statistics for > >> the same data. > >> * We can't merge statistics for different data. > >> > >> If the prerequisites aren't satisfied: > >> > >> 1. We should use utf8 (or dictionary<int32, utf8>) for > >> statistic keys? > >> 2. We can use reserved prefix such as "ARROW:"/"arrow." for > >> standard statistic keys or use prefix such as > >> "vendor1:"/"vendor1." for non-standard statistic keys. > >> > >> Here is Felipe's idea: > >> https://lists.apache.org/thread/gr2nmlrwr7d5wkz3zgq6vy5q0ow8xof2 > >> > >> 1. We can use int32 for statistic keys. > >> 2. We can use the special statistic key + a string identifier > >> for non-standard statistic keys. > >> > >> > >> What do you think about this? > >> > >> > >> Thanks, > >> -- > >> kou > >> > >> In <20240606.182727.1004633558059795207....@clear-code.com> > >> "Re: [DISCUSS] Statistics through the C data interface" on Thu, 06 Jun > >> 2024 18:27:27 +0900 (JST), > >> Sutou Kouhei <k...@clear-code.com> wrote: > >> > >> > Hi, > >> > > >> > Thanks for sharing your comments. Here is a summary so far: > >> > > >> > ---- > >> > > >> > Use cases: > >> > > >> > * Optimize query plan: e.g. JOIN for DuckDB > >> > > >> > Out of scope: > >> > > >> > * Transmit statistics through not the C data interface > >> > Examples: > >> > * Transmit statistics through Apache Arrow IPC file > >> > * Transmit statistics through Apache Arrow Flight > >> > > >> > Candidate approaches: > >> > > >> > 1. Pass statistics (encoded as an Apache Arrow data) via > >> > ArrowSchema metadata > >> > * This embeds statistics address into metadata > >> > * It's for avoiding using Apache Arrow IPC format with > >> > the C data interface > >> > 2. Embed statistics (encoded as an Apache Arrow data) into > >> > ArrowSchema metadata > >> > * This adds statistics to metadata in Apache Arrow IPC > >> > format > >> > 3. Embed statistics (encoded as JSON) into ArrowArray > >> > metadata > >> > 4. Standardize Apache Arrow schema for statistics and > >> > transmit statistics via separated API call that uses the > >> > C data interface > >> > 5. Use ADBC > >> > > >> > ---- > >> > > >> > I think that 4. is the best approach in these candidates. > >> > > >> > 1. Embedding statistics address is tricky. > >> > 2. Consumers need to parse Apache Arrow IPC format data. > >> > (The C data interface consumers may not have the > >> > feature.) > >> > 3. This will work but 4. is more generic. > >> > 5. ADBC is too large to use only for statistics. > >> > > >> > What do you think about this? > >> > > >> > > >> > If we select 4., we need to standardize Apache Arrow schema > >> > for statistics. How about the following schema? > >> > > >> > ---- > >> > Metadata: > >> > > >> > | Name | Value | Comments | > >> > |----------------------------|-------|--------- | > >> > | ARROW::statistics::version | 1.0.0 | (1) | > >> > > >> > (1) This follows semantic versioning. > >> > > >> > Fields: > >> > > >> > | Name | Type | Comments | > >> > |----------------|-----------------------| -------- | > >> > | column | utf8 | (2) | > >> > | key | utf8 not null | (3) | > >> > | value | VALUE_SCHEMA not null | | > >> > | is_approximate | bool not null | (4) | > >> > > >> > (2) If null, then the statistic applies to the entire table. > >> > It's for "row_count". > >> > (3) We'll provide pre-defined keys such as "max", "min", > >> > "byte_width" and "distinct_count" but users can also use > >> > application specific keys. > >> > (4) If true, then the value is approximate or best-effort. > >> > > >> > VALUE_SCHEMA is a dense union with members: > >> > > >> > | Name | Type | > >> > |---------|---------| > >> > | int64 | int64 | > >> > | uint64 | uint64 | > >> > | float64 | float64 | > >> > | binary | binary | > >> > > >> > If a column is an int32 column, it uses int64 for > >> > "max"/"min". We don't provide all types here. Users should > >> > use a compatible type (int64 for a int32 column) instead. > >> > ---- > >> > > >> > > >> > Thanks, > >> > -- > >> > kou > >> > > >> > > >> > In <20240522.113708.2023905028549001143....@clear-code.com> > >> > "[DISCUSS] Statistics through the C data interface" on Wed, 22 May > >> 2024 11:37:08 +0900 (JST), > >> > Sutou Kouhei <k...@clear-code.com> wrote: > >> > > >> >> Hi, > >> >> > >> >> We're discussing how to provide statistics through the C > >> >> data interface at: > >> >> https://github.com/apache/arrow/issues/38837 > >> >> > >> >> If you're interested in this feature, could you share your > >> >> comments? > >> >> > >> >> > >> >> Motivation: > >> >> > >> >> We can interchange Apache Arrow data by the C data interface > >> >> in the same process. For example, we can pass Apache Arrow > >> >> data read by Apache Arrow C++ (provider) to DuckDB > >> >> (consumer) through the C data interface. > >> >> > >> >> A provider may know Apache Arrow data statistics. For > >> >> example, a provider can know statistics when it reads Apache > >> >> Parquet data because Apache Parquet may provide statistics. > >> >> > >> >> But a consumer can't know statistics that are known by a > >> >> producer. Because there isn't a standard way to provide > >> >> statistics through the C data interface. If a consumer can > >> >> know statistics, it can process Apache Arrow data faster > >> >> based on statistics. > >> >> > >> >> > >> >> Proposal: > >> >> > >> >> https://github.com/apache/arrow/issues/38837#issuecomment-2123728784 > >> >> > >> >> How about providing statistics as a metadata in ArrowSchema? > >> >> > >> >> We reserve "ARROW" namespace for internal Apache Arrow use: > >> >> > >> >> > >> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata > >> >> > >> >>> The ARROW pattern is a reserved namespace for internal > >> >>> Arrow use in the custom_metadata fields. For example, > >> >>> ARROW:extension:name. > >> >> > >> >> So we can use "ARROW:statistics" for the metadata key. > >> >> > >> >> We can represent statistics as a ArrowArray like ADBC does. > >> >> > >> >> Here is an example ArrowSchema that is for a record batch > >> >> that has "int32 column1" and "string column2": > >> >> > >> >> ArrowSchema { > >> >> .format = "+siu", > >> >> .metadata = { > >> >> "ARROW:statistics" => ArrowArray*, /* table-level statistics such > >> as row count */ > >> >> }, > >> >> .children = { > >> >> ArrowSchema { > >> >> .name = "column1", > >> >> .format = "i", > >> >> .metadata = { > >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics > >> such as count distinct */ > >> >> }, > >> >> }, > >> >> ArrowSchema { > >> >> .name = "column2", > >> >> .format = "u", > >> >> .metadata = { > >> >> "ARROW:statistics" => ArrowArray*, /* column-level statistics > >> such as count distinct */ > >> >> }, > >> >> }, > >> >> }, > >> >> } > >> >> > >> >> The metadata value (ArrowArray* part) of '"ARROW:statistics" > >> >> => ArrowArray*' is a base 10 string of the address of the > >> >> ArrowArray. Because we can use only string for metadata > >> >> value. You can't release the statistics ArrowArray*. (Its > >> >> release is a no-op function.) It follows > >> >> > >> > https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation > >> >> semantics. (The base ArrowSchema owns statistics > >> >> ArrowArray*.) > >> >> > >> >> > >> >> ArrowArray* for statistics use the following schema: > >> >> > >> >> | Field Name | Field Type | Comments | > >> >> |----------------|----------------------------------| -------- | > >> >> | key | string not null | (1) | > >> >> | value | `VALUE_SCHEMA` not null | | > >> >> | is_approximate | bool not null | (2) | > >> >> > >> >> 1. We'll provide pre-defined keys such as "max", "min", > >> >> "byte_width" and "distinct_count" but users can also use > >> >> application specific keys. > >> >> > >> >> 2. If true, then the value is approximate or best-effort. > >> >> > >> >> VALUE_SCHEMA is a dense union with members: > >> >> > >> >> | Field Name | Field Type | Comments | > >> >> |------------|----------------------------------| -------- | > >> >> | int64 | int64 | | > >> >> | uint64 | uint64 | | > >> >> | float64 | float64 | | > >> >> | value | The same type of the ArrowSchema | (3) | > >> >> | | that is belonged to. | | > >> >> > >> >> 3. If the ArrowSchema's type is string, this type is also string. > >> >> > >> >> TODO: Is "value" good name? If we refer it from the > >> >> top-level statistics schema, we need to use > >> >> "value.value". It's a bit strange... > >> >> > >> >> > >> >> What do you think about this proposal? Could you share your > >> >> comments? > >> >> > >> >> > >> >> Thanks, > >> >> -- > >> >> kou > >> >