tustvold opened a new issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111


   **Describe the bug**
   
   Originally reported in 
https://github.com/apache/arrow-datafusion/issues/1441 and encountered again in 
#1110, `ParquetFileArrowReader` appears to read incorrect data for string 
columns that contain nulls.
   
   In particular the conditions required are for the column to be nullable, 
contain nulls, and multiple row groups. 
   
   **To Reproduce**
   
   Read 
[simple_strings.parquet.zip](https://github.com/apache/arrow-rs/files/7793788/simple_strings.parquet.zip)
 with the following code
   
   ```
   #[test]
       fn test_read_strings() {
           let testdata = arrow::util::test_util::parquet_test_data();
           let path = format!("{}/simple_strings.parquet", testdata);
           let parquet_file_reader =
               
SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
           let mut arrow_reader = 
ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
           let record_batch_reader = arrow_reader
               .get_record_reader(60)
               .expect("Failed to read into array!");
   
           let batches = record_batch_reader
               .collect::<arrow::error::Result<Vec<_>>>()
               .unwrap();
   
           assert_eq!(batches.len(), 1);
           let batch = batches.into_iter().next().unwrap();
           assert_eq!(batch.num_rows(), 6);
   
           let strings = batch
               .column(0)
               .as_any()
               .downcast_ref::<StringArray>()
               .unwrap();
   
           let strings: Vec<_> = strings.iter().collect();
   
           assert_eq!(
               &strings,
               &[
                   None,
                   Some("-1685637712"),
                   Some("512814980"),
                   Some("868743207"),
                   None,
                   Some("-1001940778")
               ]
           )
       }
   ```
   
   Fails with
   
   ```
   thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 
'assertion failed: `(left == right)`
     left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), 
None, Some("868743207")]`,
    right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), 
None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9
   ```
   
   For comparison
   
   ```
   $ python
   > import duckdb
   > duckdb.query("select * from 'simple_strings.parquet'").fetchall()
   [(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), 
('-1001940778',)]
   ```
   
   **Expected behavior**
   
   The test should pass
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to