Re: [I] Add function that converts from parquet statistics `ParquetStatistics` to arrow arrays `ArrayRef` [arrow-rs]

via GitHub Tue, 07 May 2024 13:26:19 -0700


alamb commented on issue #4328:
URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099247223


   > Just wondering if there's anything left to do to address this issue 
please? If so, I'm happy to pick this up if that's ok.
   
   That would be amazing -- thank you very much @opensourcegeek 
   
   What I think would be idea is an an API in `parquet::arrow` that looks like 
this:
   
   ```rust
   /// statistics extracted from `Statistics` as Arrow `ArrayRef`s
   ///
   /// # Note:
   /// If the corresponding `Statistics` is not present, or has no information 
for 
   /// a column, a NULL is present in the  corresponding array entry
   pub struct ArrowStatistics {
     /// min values
     min: ArrayRef,
     /// max values
     max: ArrayRef,
     /// Row counts (UInt64Array)
     row_count: ArrayRef,
     /// Null Counts (UInt64Array)
     null_count: ArrayRef,
   }
   
   // (TODO accessors for min/max/row_count/null_count)
   
   /// Extract `ArrowStatistics` from the  parquet [`Statistics`]
   pub fn parquet_stats_to_arrow(
       arrow_datatype: &DataType,
       statistics: impl IntoIterator<Item = Option<&Statistics>>
   ) -> Result<ArrowStatisics> {
     todo!()
   }
   ```
   
   (This is similar to the existing API 
[parquet](https://docs.rs/parquet/latest/parquet/index.html)::[arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html)::[parquet_to_arrow_schema](https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_schema.html#))
   
   Note it is this 
[`Statistics`](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html)
 
   
   There is a version of this code here in DataFusion that could perhaps be` 
adapted: 
https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L186
   
   ## Testing
   I suggest you add a new top level test binary in 
https://github.com/apache/arrow-rs/tree/master/parquet/tests called 
`statistics.rs`
   
   The tests should look like:
   ```
   let record_batch = make_batch_with_relevant_datatype();
   // write batch/batches to file
   // open file / extract stats from metadata
   // compare stats
   ```
   
   I can help writing these tests 
   
   I personally suggest:
   1. Make a PR with the basic API and a few basic types (like Int/UInt and 
maybe String) and figure out the test pattern (I can definitely help here)
   2. Then we can fill out support for the rest of the types in a follow on PR
   
   cc @tustvold  in case you have other ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Add function that converts from parquet statistics `ParquetStatistics` to arrow arrays `ArrayRef` [arrow-rs]

Reply via email to