lidavidm commented on issue #685: URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1593400443
So proposal is for: ``` AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out); - Parameters allow filtering down to an individual table, or you can request data for multiple tables at once - "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics ``` The result set has schema: ``` - catalog: str - db_schema: str - table_name: str - statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value) - column_name: str (null if table-wide statistics) - value: double - is_approximate: bool ``` - column_name is null if the statistic applies to the whole table - null_percentage is a value in [0, 1] representing the % of rows in the column that are null - ndv is the number of distinct values in the column (I'm tempted to take the PostgreSQL definition: positive means a fixed number of distinct values, negative means a percentage of distinct values) - row count is a value in [0, inf) - byte_width is a value in [0, inf) representing the average size in bytes of a row in the column (e.g. for a string column, this would be the average string size) unknown values should be null, or the whole row should simply be omitted Questions: - Do we care about min, max, etc.? IMO no, this complicates the encoding of the 'value' in the result, and utility is questionable. (But maybe we do want `value` to be at least `union[double, string]` to perhaps allow for this?) - Do we encode the statistic names as strings, or requiring dictionary encoding, or specifying an enumeration? (I would prefer dictionary encoding, but this complicates implementation a bit. The benefit is that if we specify some fixed dictionary values, we can save space on the common values and avoid lots of string comparisons while still allowing self-describing extensibility by vendors) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
