kou commented on issue #38837: URL: https://github.com/apache/arrow/issues/38837#issuecomment-2088101530
Some approaches that are based the C Data interface https://arrow.apache.org/docs/format/CDataInterface.html : ### (1) Add `get_statistics` callback to `ArrowArray` For example: ```c struct ArrowArray { // Array data description int64_t length; int64_t null_count; int64_t offset; int64_t n_buffers; int64_t n_children; const void** buffers; struct ArrowArray** children; struct ArrowArray* dictionary; // Callback to return statistics of this ArrowArray struct ArrowArray *(*get_statistics)(struct ArrowArray*); // Release callback void (*release)(struct ArrowArray*); // Opaque producer-specific data void* private_data; }; ``` This uses a `struct ArrowArray` to represent statistics like https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 but we can define `struct ArrowStatistics` or something instead. Note that this is a backward incompatible change. `struct ArrowArray` doesn't have version information nor spaces for extension. We can't do this without breaking backward compatibility. ### (2) Add statistics to `ArrowSchema::metadata` https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata If we choose this approach, we will preserve some metadata key such as `ARROW:XXX` like we did for IPC format: https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata Here are some ideas how to put statistics into `ArrowSchema::metadata`: 1. Use `struct ArrowArray*` (pointer) as `ARROW:statistics` metadata value 2. Use multiple metadata to represent statistics Here is an example for the 2. approach: ```json { "ARROW:statistics:column1:max": 2.9, "ARROW:statistics:column1:max:approximate": true, "ARROW:statistics:column2:average_byte_width": 29.9 } ``` TODO: * How to encode each value (`2.9`, `true` and `29.9`) to raw byte data? We can use only raw byte data for a value of `ArrowSchema::metadata`. * Can we support same name columns with this approach? * This isn't space effective because we have many duplicated texts such as `ARROW:statistics:`. Note that this is a (roughly) backward compatible change. I think that most users don't use `ARROW:XXX` as metadata key. This may not work with the C stream interface https://arrow.apache.org/docs/format/CStreamInterface.html . Because it shares one `struct ArrowSchema` with multiple `struct ArrowArray`. Each `struct ArrowArray` will have different statistics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
