alamb opened a new issue, #6164:
URL: https://github.com/apache/arrow-rs/issues/6164

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   @efredine  recently added support for extracting statistics from parquet 
files as arrays in https://github.com/apache/arrow-rs/pull/6046 using 
`StatisticsConverter`
   
   During development we have also added support for `StringViewArray` and 
`BinaryViewArray` in https://github.com/apache/arrow-rs/issues/5374
   
   Currently there is no way to read StringViewArray and BinaryViewArray 
statistics and it actually panics if you try to read data page level statistics 
as I found on https://github.com/apache/datafusion/pull/11723
   
   ```
   not implemented
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   External error: query failed: DataFusion error: Join Error
   caused by
   ```
   
   **Describe the solution you'd like**
   1. Implement the ability to extract parquet statistics as `StringView` and 
`BinaryView`
   2. Remove the panic caused by `unimplemented!` at 
https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L946
   
   The code is in 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_reader/statistics.rs
   
   
   
   **Describe alternatives you've considered**
   
   You can avoid the panic by following the model of this: 
https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L465-L467
   
   Then, you can probably write a test followig the model of utf8 and binary
   
   
https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L1897-L1917
   
   
https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L1956-L1984
   
   And then implement the missing pieces of code (use `StringViewBuilder` / 
`BinaryViewBuilder` instead of `StringBuilder` / `BinaryBuilder`) 
   
   
   I have a hacky version in https://github.com/apache/datafusion/pull/11753 
that looks something like
   
   ```rust
               DataType::Utf8View => {
                   let iterator = [<$stat_type_prefix 
ByteArrayStatsIterator>]::new($iterator);
                   let mut builder = StringViewBuilder::new();
                   for x in iterator {
                       let Some(x) = x else {
                           builder.append_null(); // no statistics value
                           continue;
                       };
   
                       let Ok(x) = std::str::from_utf8(x) else {
                           log::debug!("Utf8 statistics is a non-UTF8 value, 
ignoring it.");
                           builder.append_null();
                           continue;
                       };
   
                       builder.append_value(x);
                   }
                   Ok(Arc::new(builder.finish()))
               },
   ```
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to