scovich opened a new issue, #7119:
URL: https://github.com/apache/arrow-rs/issues/7119

   Row visitor machinery in https://github.com/delta-io/delta-kernel-rs 
recently started behaving strangely, with row visitors returning `0` or `""` 
for values that should have been NULL. Eventually, we bisected a regression in 
53.3, specifically https://github.com/apache/arrow-rs/pull/6524, which 
attempted to address https://github.com/apache/arrow-rs/issues/6510. 
   
   tl;dr: When accessing column `a.b`, where `a` is nullable and `b` is 
non-nullable, any row for which `a` is NULL will incorrectly yield a non-NULL 
`b` (default-initialized to e.g. `0` or `""`). 
   
   From what I understood, the correct way to handle non-nullable columns with 
nullable ancestors would allow the "non-nullable" column to take null values 
exactly and only on rows for which some ancestor is also NULL. The JSON reader 
does this correctly.
   
   The minimal repro below, which compares columns before and after a round 
trip through parquet, fails with:
   ```
   assertion `left == right` failed
     left: PrimitiveArray<Int32>
   [
     null,
   ]
    right: PrimitiveArray<Int32>
   [
     0,
   ]
   ```
   
   <details>
   
   ```rust
   #[test]
   fn test_arrow_bug() {
       use arrow::datatypes::{DataType, Field, Schema};
       use arrow_array::cast::AsArray as _;
       use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
       use std::sync::Arc;
   
       // Parse some JSON into a nested schema
       let schema = Arc::new(Schema::new(vec![
           Field::new(
               "outer",
               DataType::Struct(vec![
                   Field::new(
                       "inner",
                       DataType::Int32,
                       false,
                   )].into()),
               true,
           ),
       ]));
       let json_string = r#"{"outer": null}"#;
       let batch1 = arrow::json::ReaderBuilder::new(schema.clone())
           .build(json_string.as_bytes())
           .unwrap()
           .next()  
           .unwrap()
           .unwrap();
       println!("Batch 1: {batch1:?}");
   
       let col1 = batch1.column(0).as_struct().column(0);
       println!("Col1: {col1:?}");
   
       // Write the batch to a parquet file and read it back 
       let mut buffer = vec![];
       let mut writer = parquet::arrow::ArrowWriter::try_new(&mut buffer, 
schema.clone(), None)
           .unwrap();
       writer.write(&batch1).unwrap();
       writer.close().unwrap(); // writer must be closed to write footer 
       let batch2 = 
ParquetRecordBatchReaderBuilder::try_new(bytes::Bytes::from(buffer))
           .unwrap()
           .build()
           .unwrap()
           .next()
           .unwrap()
           .unwrap();
       println!("Batch 2: {batch2:?}");
   
       let col2 = batch2.column(0).as_struct().column(0);
       println!("Col2: {col2:?}");
   
       // Verify accuracty of the round trip 
       assert_eq!(batch1, batch2);
       assert_eq!(col1, col2);
   }
   ```
   
   </details>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to