Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

via GitHub Wed, 01 May 2024 00:41:08 -0700


kou commented on issue #38837:
URL: https://github.com/apache/arrow/issues/38837#issuecomment-2088101530


   Some approaches that are based the C Data interface 
https://arrow.apache.org/docs/format/CDataInterface.html :
   
   ### (1) Add `get_statistics` callback to `ArrowArray`
   
   For example:
   
   ```c
   struct ArrowArray {
     // Array data description
     int64_t length;
     int64_t null_count;
     int64_t offset;
     int64_t n_buffers;
     int64_t n_children;
     const void** buffers;
     struct ArrowArray** children;
     struct ArrowArray* dictionary;
   
     // Callback to return statistics of this ArrowArray
     struct ArrowArray *(*get_statistics)(struct ArrowArray*);
     // Release callback
     void (*release)(struct ArrowArray*);
     // Opaque producer-specific data
     void* private_data;
   };
   ```
   
   This uses a `struct ArrowArray` to represent statistics like 
https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 but we can 
define `struct ArrowStatistics` or something instead.
   
   Note that this is a backward incompatible change. `struct ArrowArray` 
doesn't have version information nor spaces for extension. We can't do this 
without breaking backward compatibility.
   
   ### (2) Add statistics to `ArrowSchema::metadata`
   
   
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
   
   If we choose this approach, we will preserve some metadata key such as 
`ARROW:XXX` like we did for IPC format: 
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
   
   Here are some ideas how to put statistics into `ArrowSchema::metadata`:
   
   1. Use `struct ArrowArray*` (pointer) as `ARROW:statistics` metadata value
   2. Use multiple metadata to represent statistics
   
   Here is an example for the 2. approach:
   
   ```json
   {
     "ARROW:statistics:column1:max": 2.9,
     "ARROW:statistics:column1:max:approximate": true,
     "ARROW:statistics:column2:average_byte_width": 29.9
   }
   ```
   
   TODO:
   * How to encode each value (`2.9`, `true` and `29.9`) to raw byte data? We 
can use only raw byte data for a value of `ArrowSchema::metadata`.
   * Can we support same name columns with this approach?
   * This isn't space effective because we have many duplicated texts such as 
`ARROW:statistics:`.
   
   Note that this is a (roughly) backward compatible change. I think that most 
users don't use `ARROW:XXX` as metadata key.
   
   This may not work with the C stream interface 
https://arrow.apache.org/docs/format/CStreamInterface.html . Because it shares 
one `struct ArrowSchema` with multiple `struct ArrowArray`. Each `struct 
ArrowArray` will have different statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Format] Passing column statistics through Arrow C data interface [arrow]

Reply via email to