lidavidm commented on issue #685:
URL: https://github.com/apache/arrow-adbc/issues/685#issuecomment-1593400443

   So proposal is for:
   
   ```
   AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, 
const char* db_schema, const char* table_name, bool approximate, struct 
ArrowArrayStream* out);
   
   - Parameters allow filtering down to an individual table, or you can request 
data for multiple tables at once
   - "approximate" is an enum allowing you to request exact statistics, or just 
get approximate/best-effort/out of date statistics
   ```
   
   The result set has schema:
   
   ```
   - catalog: str
   - db_schema: str
   - table_name: str
   - statistic_type: str (one of null percentage, row count, ndv, byte_width or 
a database-specific value)
   - column_name: str (null if table-wide statistics)
   - value: double
   - is_approximate: bool
   ```
   
   - column_name is null if the statistic applies to the whole table
   - null_percentage is a value in [0, 1] representing the % of rows in the 
column that are null
   - ndv is the number of distinct values in the column (I'm tempted to take 
the PostgreSQL definition: positive means a fixed number of distinct values, 
negative means a percentage of distinct values)
   - row count is a value in [0, inf)
   - byte_width is a value in [0, inf) representing the average size in bytes 
of a row in the column (e.g. for a string column, this would be the average 
string size)
   
   unknown values should be null, or the whole row should simply be omitted
   
   
   Questions:
   - Do we care about min, max, etc.? IMO no, this complicates the encoding of 
the 'value' in the result, and utility is questionable. (But maybe we do want 
`value` to be at least `union[double, string]` to perhaps allow for this?)
   - Do we encode the statistic names as strings, or requiring dictionary 
encoding, or specifying an enumeration? (I would prefer dictionary encoding, 
but this complicates implementation a bit. The benefit is that if we specify 
some fixed dictionary values, we can save space on the common values and avoid 
lots of string comparisons while still allowing self-describing extensibility 
by vendors)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to