alamb opened a new issue, #6164: URL: https://github.com/apache/arrow-rs/issues/6164
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** @efredine recently added support for extracting statistics from parquet files as arrays in https://github.com/apache/arrow-rs/pull/6046 using `StatisticsConverter` During development we have also added support for `StringViewArray` and `BinaryViewArray` in https://github.com/apache/arrow-rs/issues/5374 Currently there is no way to read StringViewArray and BinaryViewArray statistics and it actually panics if you try to read data page level statistics as I found on https://github.com/apache/datafusion/pull/11723 ``` not implemented note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace External error: query failed: DataFusion error: Join Error caused by ``` **Describe the solution you'd like** 1. Implement the ability to extract parquet statistics as `StringView` and `BinaryView` 2. Remove the panic caused by `unimplemented!` at https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L946 The code is in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_reader/statistics.rs **Describe alternatives you've considered** You can avoid the panic by following the model of this: https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L465-L467 Then, you can probably write a test followig the model of utf8 and binary https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L1897-L1917 https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_reader/statistics.rs#L1956-L1984 And then implement the missing pieces of code (use `StringViewBuilder` / `BinaryViewBuilder` instead of `StringBuilder` / `BinaryBuilder`) I have a hacky version in https://github.com/apache/datafusion/pull/11753 that looks something like ```rust DataType::Utf8View => { let iterator = [<$stat_type_prefix ByteArrayStatsIterator>]::new($iterator); let mut builder = StringViewBuilder::new(); for x in iterator { let Some(x) = x else { builder.append_null(); // no statistics value continue; }; let Ok(x) = std::str::from_utf8(x) else { log::debug!("Utf8 statistics is a non-UTF8 value, ignoring it."); builder.append_null(); continue; }; builder.append_value(x); } Ok(Arc::new(builder.finish())) }, ``` **Additional context** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
