kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2123728784
Updated version. Feedback is welcome. I'll share this idea to the `[email protected]` mailing list too. It's based on https://github.com/apache/arrow/issues/38837#issuecomment-2108891730 . If we have a record batch that has `int32 column1` and `string column2`, we have the following `ArrowSchema`. Note that `metadata` has `"ARROW:statistics" => ArrowArray*`. `ArrowArray*` is a base 10 string of the address of an `ArrowArray` because we can use only string for metadata value. You can't release the statistics `ArrowArray*`. (Its `release` is `NULL`.) It follows https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation semantics. (The base `ArrowSchema` owns statistics `ArrowArray*`.) ```text ArrowSchema { .format = "+siu", .metadata = { "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */ }, .children = { ArrowSchema { .name = "column1", .format = "i", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, ArrowSchema { .name = "column2", .format = "u", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, }, } ``` `ArrowArray*` for statistics use the following schema: | Field Name | Field Type | Comments | |----------------|----------------------------------| -------- | | key | string not null | (1) | | value | `VALUE_SCHEMA` not null | | | is_approximate | bool not null | (2) | 1. We'll provide pre-defined keys such as `max`, `min`, `byte_width` and `distinct_count` but users can use application specific keys too. 2. If true, then the value is approximate or best-effort. `VALUE_SCHEMA` is a dense union with members: | Field Name | Field Type | Comments | |--------------------------|----------------------------------| -------- | | int64 | int64 | | | uint64 | uint64 | | | float64 | float64 | | | value | The same type of the `ArrowSchema` that is belonged to. | (3) | 3. If the `ArrowSchema`'s type is `string`, this type is also `string`. TODO: Is `value` good name? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
