kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2074371230
I'm considering some approaches for this use case. This is not completed yet but share my idea so far. Feedback is appreciated. ADBC uses the following schema to return statistics: https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L1739-L1778 It's designed for returning statistics of a database. We can simplify this schema because we can just return statistics of a record batch. For example: | Field Name | Field Type | Comments | |--------------------------|----------------------------------| -------- | | column_name | utf8 | (1) | | statistic_key | int16 not null | (2) | | statistic_value | `VALUE_SCHEMA` not null | | | statistic_is_approximate | bool not null | (3) | 1. If null, then the statistic applies to the entire table. 2. A dictionary-encoded statistic name (although we do not use the Arrow dictionary type). Values in [0, 1024) are reserved for ADBC. Other values are for implementation-specific statistics. For the definitions of predefined statistic types, see [adbc-table-statistics](https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L524-L570). To get driver-specific statistic names, use `AdbcConnectionGetStatisticNames()`. 3. If true, then the value is approximate or best-effort. `VALUE_SCHEMA` is a dense union with members: | Field Name | Field Type | |--------------------------|----------------------------------| | int64 | int64 | | uint64 | uint64 | | float64 | float64 | | binary | binary | TODO: How to represent statistic key? Should we use ADBC style? (Assigning an ID for each statistic key and using it.) If we represent statistics as a record batch, we can pass statistics through Arrow C data interface. This may be a reasonable approach. If we use this approach, we need to do the followings: * Define a schema as a specification * Add statistics related APIs to Apache Arrow C++ and other implementation because we need two more implementations for specification change * https://arrow.apache.org/docs/format/Changing.html#at-least-two-reference-implementations * This is not a format change but it's better that we should follow the rule * We can work on this for Apache Arrow C++ before we propose a specification because statistics will be useful for general propose * Apache Arrow C++: Add support for importing statistics from Apache Parquet C++ * ... TODO: Consider statistics related API for Apache Arrow C++. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org