kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2096990408
Thanks for sharing more information. Here is my new idea: It's based on the "(2) Add statistics to `ArrowSchema::metadata`` in https://github.com/apache/arrow/issues/38837#issuecomment-2088101530 . It puts all statistics to the `metadata` in the top-level `ArrowSchema`. But I noticed that we don't need to do it. We can put statistics for each child (column) to child `ArrowSchema::metadata`. If we have a record batch that has `int32 column1` and `string column2`, we have the following `ArrowSchema`: ```text ArrowSchema { .format = "+siu", .children = { ArrowSchema { .name = "column1", .format = "i", }, ArrowSchema { .name = "column2", .format = "u", }, }, } ``` We can put a `ArrowArray*` for statistics to each child `ArrowSchema::metadata` instead of putting all statistics to the top-level `Arrow::Schema::metadata`: ```text ArrowSchema { .format = "+siu", .metadata = { "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */ }, .children = { ArrowSchema { .name = "column1", .format = "i", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, ArrowSchema { .name = "column2", .format = "u", .metadata = { "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */ }, }, }, } ``` `ArrowArray*` for statistics can use simpler schema than https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 : | Field Name | Field Type | Comments | |----------------|----------------------------------| -------- | | key | int16 not null | (1) | | value | `VALUE_SCHEMA` not null | | | is_approximate | bool not null | (2) | 1. A dictionary-encoded statistic name (although we do not use the Arrow dictionary type). Values in [0, 1024) are reserved for ADBC. Other values are for implementation-specific statistics. For the definitions of predefined statistic types, see [adbc-table-statistics](https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L524-L570). To get driver-specific statistic names, use `AdbcConnectionGetStatisticNames()`. 2. If true, then the value is approximate or best-effort. `VALUE_SCHEMA` is a dense union with members: | Field Name | Field Type | |--------------------------|----------------------------------| | int64 | int64 | | uint64 | uint64 | | float64 | float64 | | binary | binary | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
