scovich opened a new issue, #7119: URL: https://github.com/apache/arrow-rs/issues/7119
Row visitor machinery in https://github.com/delta-io/delta-kernel-rs recently started behaving strangely, with row visitors returning `0` or `""` for values that should have been NULL. Eventually, we bisected a regression in 53.3, specifically https://github.com/apache/arrow-rs/pull/6524, which attempted to address https://github.com/apache/arrow-rs/issues/6510. tl;dr: When accessing column `a.b`, where `a` is nullable and `b` is non-nullable, any row for which `a` is NULL will incorrectly yield a non-NULL `b` (default-initialized to e.g. `0` or `""`). From what I understood, the correct way to handle non-nullable columns with nullable ancestors would allow the "non-nullable" column to take null values exactly and only on rows for which some ancestor is also NULL. The JSON reader does this correctly. The minimal repro below, which compares columns before and after a round trip through parquet, fails with: ``` assertion `left == right` failed left: PrimitiveArray<Int32> [ null, ] right: PrimitiveArray<Int32> [ 0, ] ``` <details> ```rust #[test] fn test_arrow_bug() { use arrow::datatypes::{DataType, Field, Schema}; use arrow_array::cast::AsArray as _; use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder; use std::sync::Arc; // Parse some JSON into a nested schema let schema = Arc::new(Schema::new(vec![ Field::new( "outer", DataType::Struct(vec![ Field::new( "inner", DataType::Int32, false, )].into()), true, ), ])); let json_string = r#"{"outer": null}"#; let batch1 = arrow::json::ReaderBuilder::new(schema.clone()) .build(json_string.as_bytes()) .unwrap() .next() .unwrap() .unwrap(); println!("Batch 1: {batch1:?}"); let col1 = batch1.column(0).as_struct().column(0); println!("Col1: {col1:?}"); // Write the batch to a parquet file and read it back let mut buffer = vec![]; let mut writer = parquet::arrow::ArrowWriter::try_new(&mut buffer, schema.clone(), None) .unwrap(); writer.write(&batch1).unwrap(); writer.close().unwrap(); // writer must be closed to write footer let batch2 = ParquetRecordBatchReaderBuilder::try_new(bytes::Bytes::from(buffer)) .unwrap() .build() .unwrap() .next() .unwrap() .unwrap(); println!("Batch 2: {batch2:?}"); let col2 = batch2.column(0).as_struct().column(0); println!("Col2: {col2:?}"); // Verify accuracty of the round trip assert_eq!(batch1, batch2); assert_eq!(col1, col2); } ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org