Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Mon, 13 May 2024 15:12:53 -0700


kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2108891730


   Here is the latest idea. Feedback is welcome.
   
   It's based on 
https://github.com/apache/arrow/issues/38837#issuecomment-2096990408 and 
https://github.com/apache/arrow/issues/38837#issuecomment-2097062873 .
   
   If we have a record batch that has `int32 column1` and `string column2`, we 
have the following `ArrowSchema`. Note that `metadata` has `"ARROW:statistics" 
=> ArrowArray*`.
   
   ```text
   ArrowSchema {
     .format = "+siu",
     .metadata = {
       "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
     },
     .children = {
       ArrowSchema {
         .name = "column1",
         .format = "i",
         .metadata = {
           "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
as count distinct */
         },
       },
       ArrowSchema {
         .name = "column2",
         .format = "u",
         .metadata = {
           "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
as count distinct */
         },
       },
     },
   }
   ```
   
   `ArrowArray*` for statistics use the following schema:
   
   | Field Name     | Field Type                       | Comments |
   |----------------|----------------------------------| -------- |
   | key            | int16 not null                   | (1)      |
   | value          | `VALUE_SCHEMA` not null            |          |
   | is_approximate | bool not null                    | (2)      |
   
   1. A dictionary-encoded statistic name (although we do not use the Arrow
      dictionary type). Values in [0, 1024) are reserved for Apache Arrow. The 
values
      should be aligned with ADBC. Other
      values are for implementation-specific statistics.  For the definitions
      of predefined statistic types, see 
[adbc-table-statistics](https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L524-L570).
      
      TODO: Should we provide a feature to get
      driver-specific statistic names. ADBC has 
`AdbcConnectionGetStatisticNames()`?
      Or should we use `string` instead of `int16`? 
   2. If true, then the value is approximate or best-effort.
   
   `VALUE_SCHEMA` is a dense union with members:
   
   | Field Name               | Field Type                       | Comments |
   |--------------------------|----------------------------------| -------- |
   | int64                    | int64                            |          |
   | uint64                   | uint64                           |          |
   | float64                  | float64                          |          |
   | value                    | The same type of the `ArrowSchema` that is 
belonged to. | (3)      |
   
   3. If the `ArrowSchema`'s type is `string`, this type is also `string`.
   
      TODO: Is `value` good name?
   
   TODO: Should we embed `VALUE_SCHEMA` to the statistics schema something like 
the following?
   
   | Field Name     | Field Type                       | Comments |
   |----------------|----------------------------------| -------- |
   | key            | int16 not null                   | (1)      |
   | value_int64    | int64            | (4)      |
   | value_uint64   | uint64            | (4)      |
   | value_float64  | float64            | (4)       |
   | value          | The same type of the `ArrowSchema` that is belonged to. | 
(3) (4)  |
   | is_approximate | bool not null                    | (2)      |
   
   4. One of them is "not null".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to