kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2108891730
Here is the latest idea. Feedback is welcome. It's based on https://github.com/apache/arrow/issues/38837#issuecomment-2096990408 and https://github.com/apache/arrow/issues/38837#issuecomment-2097062873 . If we have a record batch that has `int32 column1` and `string column2`, we have the following `ArrowSchema`. Note that `metadata` has `"ARROW:statistics" => ArrowArray*`. ```text ArrowSchema { .format = "+siu", .metadata = { "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */ }, .children = { ArrowSchema { .name = "column1", .format = "i", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, ArrowSchema { .name = "column2", .format = "u", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, }, } ``` `ArrowArray*` for statistics use the following schema: | Field Name | Field Type | Comments | |----------------|----------------------------------| -------- | | key | int16 not null | (1) | | value | `VALUE_SCHEMA` not null | | | is_approximate | bool not null | (2) | 1. A dictionary-encoded statistic name (although we do not use the Arrow dictionary type). Values in [0, 1024) are reserved for Apache Arrow. The values should be aligned with ADBC. Other values are for implementation-specific statistics. For the definitions of predefined statistic types, see [adbc-table-statistics](https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L524-L570). TODO: Should we provide a feature to get driver-specific statistic names. ADBC has `AdbcConnectionGetStatisticNames()`? Or should we use `string` instead of `int16`? 2. If true, then the value is approximate or best-effort. `VALUE_SCHEMA` is a dense union with members: | Field Name | Field Type | Comments | |--------------------------|----------------------------------| -------- | | int64 | int64 | | | uint64 | uint64 | | | float64 | float64 | | | value | The same type of the `ArrowSchema` that is belonged to. | (3) | 3. If the `ArrowSchema`'s type is `string`, this type is also `string`. TODO: Is `value` good name? TODO: Should we embed `VALUE_SCHEMA` to the statistics schema something like the following? | Field Name | Field Type | Comments | |----------------|----------------------------------| -------- | | key | int16 not null | (1) | | value_int64 | int64 | (4) | | value_uint64 | uint64 | (4) | | value_float64 | float64 | (4) | | value | The same type of the `ArrowSchema` that is belonged to. | (3) (4) | | is_approximate | bool not null | (2) | 4. One of them is "not null". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org