alamb commented on code in PR #11295:
URL: https://github.com/apache/datafusion/pull/11295#discussion_r1667745320
##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -919,16 +915,28 @@ macro_rules! get_data_page_statistics {
})
},
Some(DataType::FixedSizeBinary(size)) => {
- Ok(Arc::new(
- FixedSizeBinaryArray::try_from_iter(
- [<$stat_type_prefix
FixedLenByteArrayDataPageStatsIterator>]::new($iterator)
- .flat_map(|x| x.into_iter())
- .filter_map(|x| x)
- ).unwrap_or_else(|e| {
- log::debug!("FixedSizeBinary statistics is
invalid: {}", e);
- FixedSizeBinaryArray::new(*size, vec![].into(),
None)
- })
- ))
+ let mut builder = FixedSizeBinaryBuilder::new(*size);
Review Comment:
this looks like a nice change to me
##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -823,11 +819,11 @@ macro_rules! get_data_page_statistics {
Float16Array::from_iter(
[<$stat_type_prefix
Float16DataPageStatsIterator>]::new($iterator)
.map(|x| {
- x.into_iter().filter_map(|x| {
- x.and_then(|x|
Some(from_bytes_to_f16(x.data())))
+ x.into_iter().map(|x| {
+ x.and_then(|x| from_bytes_to_f16(x.data()))
})
})
- .flatten()
+ .flatten().collect::<Vec<_>>()
Review Comment:
That is an excellent call. It looks to me like this change adds `collect` to
make this use consistent with the other arms. I tried removing `collect`
locally and indeed the tests still pass
I think `collect` is necessary in some cases to avoid a panic due to the
iterator was not sized
like when I removed it from the BooleanArray I got an error like this
```
Iterator must be sized
thread 'parquet::arrow_statistics::test_data_page_stats_with_all_null_page'
panicked at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-52.1.0/src/array/boolean_array.rs:407:33:
Iterator must be sized
stack backtrace:
```
##########
datafusion/core/tests/parquet/arrow_statistics.rs:
##########
@@ -503,6 +512,71 @@ async fn test_multiple_data_pages_nulls_and_negatives() {
.run()
}
+#[tokio::test]
+async fn test_data_page_stats_with_all_null_page() {
+ for data_type in &[
+ DataType::Boolean,
+ DataType::UInt64,
+ DataType::UInt32,
+ DataType::UInt16,
+ DataType::UInt8,
+ DataType::Int64,
+ DataType::Int32,
+ DataType::Int16,
+ DataType::Int8,
+ DataType::Float16,
+ DataType::Float32,
+ DataType::Float64,
+ DataType::Date32,
+ DataType::Date64,
+ DataType::Time32(TimeUnit::Millisecond),
+ DataType::Time32(TimeUnit::Second),
+ DataType::Time64(TimeUnit::Microsecond),
+ DataType::Time64(TimeUnit::Nanosecond),
+ DataType::Timestamp(TimeUnit::Second, None),
+ DataType::Timestamp(TimeUnit::Millisecond, None),
+ DataType::Timestamp(TimeUnit::Microsecond, None),
+ DataType::Timestamp(TimeUnit::Nanosecond, None),
+ DataType::Binary,
+ DataType::LargeBinary,
+ DataType::FixedSizeBinary(3),
+ DataType::Utf8,
+ DataType::LargeUtf8,
+ DataType::Dictionary(Box::new(DataType::Int32),
Box::new(DataType::Utf8)),
+ DataType::Decimal128(8, 2), // as INT32
+ DataType::Decimal128(10, 2), // as INT64
+ DataType::Decimal128(20, 2), // as FIXED_LEN_BYTE_ARRAY
+ DataType::Decimal256(8, 2), // as INT32
+ DataType::Decimal256(10, 2), // as INT64
+ DataType::Decimal256(20, 2), // as FIXED_LEN_BYTE_ARRAY
+ ] {
+ let batch =
+ RecordBatch::try_from_iter(vec![("col", new_null_array(data_type,
4))])
+ .expect("record batch creation");
+
+ let reader =
+ build_parquet_file(4, Some(EnabledStatistics::Page), Some(4),
vec![batch]);
+
+ let expected_data_type = match data_type {
+ DataType::Dictionary(_, value_type) => value_type.as_ref(),
+ _ => data_type,
+ };
+
+ // There is one data page with 4 nulls
+ // The statistics should be present but null
+ Test {
Review Comment:
I verified that this test covered the code changes by running the test
without the code changes and it failed as expected.
```text
thread 'parquet::arrow_statistics::test_data_page_stats_with_all_null_page'
panicked at datafusion/core/tests/parquet/arrow_statistics.rs:276:13:
assertion `left == right` failed: col: Mismatch with expected data page
minimums
left: PrimitiveArray<UInt64>
[
]
right: PrimitiveArray<UInt64>
[
null,
]
stack backtrace:
0: rust_begin_unwind
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
1: core::panicking::panic_fmt
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364:5
4: parquet_exec::parquet::arrow_statistics::Test::run_checks
at ./tests/parquet/arrow_statistics.rs:276:13
5: parquet_exec::parquet::arrow_statistics::Test::run
at ./tests/parquet/arrow_statistics.rs:229:9
6:
parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
at ./tests/parquet/arrow_statistics.rs:567:9
7: <core::pin::Pin<P> as core::future::future::Future>::poll
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
8: <core::pin::Pin<P> as core::future::future::Future>::poll
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
9:
tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:57
10: tokio::runtime::coop::with_budget
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
11: tokio::runtime::coop::budget
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
12:
tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:25
13: tokio::runtime::scheduler::current_thread::Context::enter
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:404:19
14:
tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:658:36
15:
tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:68
16: tokio::runtime::context::scoped::Scoped<T>::set
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/scoped.rs:40:9
17: tokio::runtime::context::set_scheduler::{{closure}}
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:26
18: std::thread::local::LocalKey<T>::try_with
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:286:12
19: std::thread::local::LocalKey<T>::with
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:262:9
20: tokio::runtime::context::set_scheduler
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:9
21: tokio::runtime::scheduler::current_thread::CoreGuard::enter
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:27
22: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:646:19
23:
tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:175:28
24: tokio::runtime::context::runtime::enter_runtime
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
25: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:167:9
26: tokio::runtime::runtime::Runtime::block_on
at
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/runtime.rs:347:47
27:
parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page
at ./tests/parquet/arrow_statistics.rs:517:5
28:
parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
at ./tests/parquet/arrow_statistics.rs:516:51
29: core::ops::function::FnOnce::call_once
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
30: core::ops::function::FnOnce::call_once
at
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose
backtrace.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]