kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

   Updated version. Feedback is welcome. I'll share this idea to the 
`[email protected]` mailing list too.
   
   It's based on 
https://github.com/apache/arrow/issues/38837#issuecomment-2108891730 .
   
   If we have a record batch that has `int32 column1` and `string column2`, we 
have the following `ArrowSchema`. Note that `metadata` has `"ARROW:statistics" 
=> ArrowArray*`. `ArrowArray*` is a base 10 string of the address of an 
`ArrowArray` because we can use only string for metadata value. You can't 
release the statistics `ArrowArray*`. (Its `release` is `NULL`.) It follows 
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation 
semantics. (The base `ArrowSchema` owns statistics `ArrowArray*`.)
   
   ```text
   ArrowSchema {
     .format = "+siu",
     .metadata = {
       "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
     },
     .children = {
       ArrowSchema {
         .name = "column1",
         .format = "i",
         .metadata = {
           "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
as count distinct */
         },
       },
       ArrowSchema {
         .name = "column2",
         .format = "u",
         .metadata = {
           "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
as count distinct */
         },
       },
     },
   }
   ```
   
   `ArrowArray*` for statistics use the following schema:
   
   | Field Name     | Field Type                       | Comments |
   |----------------|----------------------------------| -------- |
   | key            | string not null                  | (1)      |
   | value          | `VALUE_SCHEMA` not null            |          |
   | is_approximate | bool not null                    | (2)      |
   
   1. We'll provide pre-defined keys such as `max`, `min`, `byte_width` and 
`distinct_count` but users can use application specific keys too.   
   2. If true, then the value is approximate or best-effort.
   
   `VALUE_SCHEMA` is a dense union with members:
   
   | Field Name               | Field Type                       | Comments |
   |--------------------------|----------------------------------| -------- |
   | int64                    | int64                            |          |
   | uint64                   | uint64                           |          |
   | float64                  | float64                          |          |
   | value                    | The same type of the `ArrowSchema` that is 
belonged to. | (3)      |
   
   3. If the `ArrowSchema`'s type is `string`, this type is also `string`.
   
      TODO: Is `value` good name?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to