Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Mon, 06 May 2024 14:55:36 -0700


kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2096990408


   Thanks for sharing more information.
   
   Here is my new idea:
   
   It's based on the "(2) Add statistics to `ArrowSchema::metadata`` in 
https://github.com/apache/arrow/issues/38837#issuecomment-2088101530 .
   
   It puts all statistics to the `metadata` in the top-level `ArrowSchema`. But 
I noticed that we don't need to do it. We can put statistics for each child 
(column) to child `ArrowSchema::metadata`.
   
   If we have a record batch that has `int32 column1` and `string column2`, we 
have the following `ArrowSchema`:
   
   ```text
   ArrowSchema {
     .format = "+siu",
     .children = {
       ArrowSchema {
         .name = "column1",
         .format = "i",
       },
       ArrowSchema {
         .name = "column2",
         .format = "u",
       },
     },
   }
   ```
   
   We can put a `ArrowArray*` for statistics to each child 
`ArrowSchema::metadata` instead of putting all statistics to the top-level 
`Arrow::Schema::metadata`:
   
   ```text
   ArrowSchema {
     .format = "+siu",
     .metadata = {
       "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
     },
     .children = {
       ArrowSchema {
         .name = "column1",
         .format = "i",
         .metadata = {
           "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
as count distinct */
         },
       },
       ArrowSchema {
         .name = "column2",
         .format = "u",
         .metadata = {
           "ARROW:statistics" => ArrowArray*, /* column-level statistics such 
as count distinct */
         },
       },
     },
   }
   ```
   
   `ArrowArray*` for statistics can use simpler schema than 
https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 :
   
   | Field Name     | Field Type                       | Comments |
   |----------------|----------------------------------| -------- |
   | key            | int16 not null                   | (1)      |
   | value          | `VALUE_SCHEMA` not null            |          |
   | is_approximate | bool not null                    | (2)      |
   
   1. A dictionary-encoded statistic name (although we do not use the Arrow
      dictionary type). Values in [0, 1024) are reserved for ADBC.  Other
      values are for implementation-specific statistics.  For the definitions
      of predefined statistic types, see 
[adbc-table-statistics](https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L524-L570).
  To get
      driver-specific statistic names, use `AdbcConnectionGetStatisticNames()`.
   2. If true, then the value is approximate or best-effort.
   
   `VALUE_SCHEMA` is a dense union with members:
   
   | Field Name               | Field Type                       |
   |--------------------------|----------------------------------|
   | int64                    | int64                            |
   | uint64                   | uint64                           |
   | float64                  | float64                          |
   | binary                   | binary                           |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to