alamb commented on code in PR #11295:
URL: https://github.com/apache/datafusion/pull/11295#discussion_r1667745320


##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -919,16 +915,28 @@ macro_rules! get_data_page_statistics {
                     })
                 },
                Some(DataType::FixedSizeBinary(size)) => {
-                    Ok(Arc::new(
-                        FixedSizeBinaryArray::try_from_iter(
-                            [<$stat_type_prefix 
FixedLenByteArrayDataPageStatsIterator>]::new($iterator)
-                            .flat_map(|x| x.into_iter())
-                            .filter_map(|x| x)
-                        ).unwrap_or_else(|e| {
-                            log::debug!("FixedSizeBinary statistics is 
invalid: {}", e);
-                            FixedSizeBinaryArray::new(*size, vec![].into(), 
None)
-                        })
-                    ))
+                    let mut builder = FixedSizeBinaryBuilder::new(*size);

Review Comment:
   this looks like a nice change to me



##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -823,11 +819,11 @@ macro_rules! get_data_page_statistics {
                     Float16Array::from_iter(
                         [<$stat_type_prefix 
Float16DataPageStatsIterator>]::new($iterator)
                             .map(|x| {
-                                x.into_iter().filter_map(|x| {
-                                    x.and_then(|x| 
Some(from_bytes_to_f16(x.data())))
+                                x.into_iter().map(|x| {
+                                    x.and_then(|x| from_bytes_to_f16(x.data()))
                                 })
                             })
-                            .flatten()
+                            .flatten().collect::<Vec<_>>()

Review Comment:
   That is an excellent call. It looks to me like this change adds `collect` to 
make this use consistent with the other arms. I tried removing `collect` 
locally and indeed the tests still pass
   
    I think `collect` is necessary in some cases to avoid a panic due to the 
iterator was not sized
   
   like when I removed it from the BooleanArray I got an error like this
   
   ```
   Iterator must be sized
   thread 'parquet::arrow_statistics::test_data_page_stats_with_all_null_page' 
panicked at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-52.1.0/src/array/boolean_array.rs:407:33:
   Iterator must be sized
   stack backtrace:
   ```



##########
datafusion/core/tests/parquet/arrow_statistics.rs:
##########
@@ -503,6 +512,71 @@ async fn test_multiple_data_pages_nulls_and_negatives() {
     .run()
 }
 
+#[tokio::test]
+async fn test_data_page_stats_with_all_null_page() {
+    for data_type in &[
+        DataType::Boolean,
+        DataType::UInt64,
+        DataType::UInt32,
+        DataType::UInt16,
+        DataType::UInt8,
+        DataType::Int64,
+        DataType::Int32,
+        DataType::Int16,
+        DataType::Int8,
+        DataType::Float16,
+        DataType::Float32,
+        DataType::Float64,
+        DataType::Date32,
+        DataType::Date64,
+        DataType::Time32(TimeUnit::Millisecond),
+        DataType::Time32(TimeUnit::Second),
+        DataType::Time64(TimeUnit::Microsecond),
+        DataType::Time64(TimeUnit::Nanosecond),
+        DataType::Timestamp(TimeUnit::Second, None),
+        DataType::Timestamp(TimeUnit::Millisecond, None),
+        DataType::Timestamp(TimeUnit::Microsecond, None),
+        DataType::Timestamp(TimeUnit::Nanosecond, None),
+        DataType::Binary,
+        DataType::LargeBinary,
+        DataType::FixedSizeBinary(3),
+        DataType::Utf8,
+        DataType::LargeUtf8,
+        DataType::Dictionary(Box::new(DataType::Int32), 
Box::new(DataType::Utf8)),
+        DataType::Decimal128(8, 2),  // as INT32
+        DataType::Decimal128(10, 2), // as INT64
+        DataType::Decimal128(20, 2), // as FIXED_LEN_BYTE_ARRAY
+        DataType::Decimal256(8, 2),  // as INT32
+        DataType::Decimal256(10, 2), // as INT64
+        DataType::Decimal256(20, 2), // as FIXED_LEN_BYTE_ARRAY
+    ] {
+        let batch =
+            RecordBatch::try_from_iter(vec![("col", new_null_array(data_type, 
4))])
+                .expect("record batch creation");
+
+        let reader =
+            build_parquet_file(4, Some(EnabledStatistics::Page), Some(4), 
vec![batch]);
+
+        let expected_data_type = match data_type {
+            DataType::Dictionary(_, value_type) => value_type.as_ref(),
+            _ => data_type,
+        };
+
+        // There is one data page with 4 nulls
+        // The statistics should be present but null
+        Test {

Review Comment:
   I verified that this test covered the code changes by running the test 
without the code changes and it failed as expected. 
   
   ```text
   thread 'parquet::arrow_statistics::test_data_page_stats_with_all_null_page' 
panicked at datafusion/core/tests/parquet/arrow_statistics.rs:276:13:
   assertion `left == right` failed: col: Mismatch with expected data page 
minimums
     left: PrimitiveArray<UInt64>
   [
   ]
    right: PrimitiveArray<UInt64>
   [
     null,
   ]
   stack backtrace:
      0: rust_begin_unwind
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
      1: core::panicking::panic_fmt
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
      2: core::panicking::assert_failed_inner
      3: core::panicking::assert_failed
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364:5
      4: parquet_exec::parquet::arrow_statistics::Test::run_checks
                at ./tests/parquet/arrow_statistics.rs:276:13
      5: parquet_exec::parquet::arrow_statistics::Test::run
                at ./tests/parquet/arrow_statistics.rs:229:9
      6: 
parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
                at ./tests/parquet/arrow_statistics.rs:567:9
      7: <core::pin::Pin<P> as core::future::future::Future>::poll
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
      8: <core::pin::Pin<P> as core::future::future::Future>::poll
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
      9: 
tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:57
     10: tokio::runtime::coop::with_budget
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
     11: tokio::runtime::coop::budget
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
     12: 
tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:25
     13: tokio::runtime::scheduler::current_thread::Context::enter
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:404:19
     14: 
tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:658:36
     15: 
tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:68
     16: tokio::runtime::context::scoped::Scoped<T>::set
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/scoped.rs:40:9
     17: tokio::runtime::context::set_scheduler::{{closure}}
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:26
     18: std::thread::local::LocalKey<T>::try_with
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:286:12
     19: std::thread::local::LocalKey<T>::with
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:262:9
     20: tokio::runtime::context::set_scheduler
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:9
     21: tokio::runtime::scheduler::current_thread::CoreGuard::enter
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:27
     22: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:646:19
     23: 
tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:175:28
     24: tokio::runtime::context::runtime::enter_runtime
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
     25: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:167:9
     26: tokio::runtime::runtime::Runtime::block_on
                at 
/Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/runtime.rs:347:47
     27: 
parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page
                at ./tests/parquet/arrow_statistics.rs:517:5
     28: 
parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
                at ./tests/parquet/arrow_statistics.rs:516:51
     29: core::ops::function::FnOnce::call_once
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
     30: core::ops::function::FnOnce::call_once
                at 
/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose 
backtrace.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to