alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099247223
> Just wondering if there's anything left to do to address this issue
please? If so, I'm happy to pick this up if that's ok.
That would be amazing -- thank you very much @opensourcegeek
What I think would be idea is an an API in `parquet::arrow` that looks like
this:
```rust
/// statistics extracted from `Statistics` as Arrow `ArrayRef`s
///
/// # Note:
/// If the corresponding `Statistics` is not present, or has no information
for
/// a column, a NULL is present in the corresponding array entry
pub struct ArrowStatistics {
/// min values
min: ArrayRef,
/// max values
max: ArrayRef,
/// Row counts (UInt64Array)
row_count: ArrayRef,
/// Null Counts (UInt64Array)
null_count: ArrayRef,
}
// (TODO accessors for min/max/row_count/null_count)
/// Extract `ArrowStatistics` from the parquet [`Statistics`]
pub fn parquet_stats_to_arrow(
arrow_datatype: &DataType,
statistics: impl IntoIterator<Item = Option<&Statistics>>
) -> Result<ArrowStatisics> {
todo!()
}
```
(This is similar to the existing API
[parquet](https://docs.rs/parquet/latest/parquet/index.html)::[arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html)::[parquet_to_arrow_schema](https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_schema.html#))
Note it is this
[`Statistics`](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html)
There is a version of this code here in DataFusion that could perhaps be`
adapted:
https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L186
## Testing
I suggest you add a new top level test binary in
https://github.com/apache/arrow-rs/tree/master/parquet/tests called
`statistics.rs`
The tests should look like:
```
let record_batch = make_batch_with_relevant_datatype();
// write batch/batches to file
// open file / extract stats from metadata
// compare stats
```
I can help writing these tests
I personally suggest:
1. Make a PR with the basic API and a few basic types (like Int/UInt and
maybe String) and figure out the test pattern (I can definitely help here)
2. Then we can fill out support for the rest of the types in a follow on PR
cc @tustvold in case you have other ideas
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]