helgikrs opened a new issue #1184:
URL: https://github.com/apache/arrow-rs/issues/1184


   **Describe the bug**
   Writing an arrow record batch with structs nested within lists using the 
parquet writer produces a parquet file with incorrect values when there are 
null or empty lists present.
   
   
   **To Reproduce**
   The following program produces a parquet file `out.parquet`.
   
   ```rust
   use std::sync::Arc;
   
   use arrow::array::{Int32Builder, ListBuilder, StructBuilder};
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow::record_batch::RecordBatch;
   
   fn main() {
       // define schema
       let int_field = Field::new("a", DataType::Int32, true);
       let item_field = Field::new("item", 
DataType::Struct(vec![int_field.clone()]), true);
       let list_field = Field::new("list", 
DataType::List(Box::new(item_field)), true);
   
       let int_builder = Int32Builder::new(10);
       let struct_builder = StructBuilder::new(vec![int_field], 
vec![Box::new(int_builder)]);
       let mut list_builder = ListBuilder::new(struct_builder);
   
       // [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
       //
       // [{a: 1}]
       let values = list_builder.values();
       values
           .field_builder::<Int32Builder>(0)
           .unwrap()
           .append_value(1)
           .unwrap();
       values.append(true).unwrap();
       list_builder.append(true).unwrap();
   
       // []
       list_builder.append(true).unwrap();
   
       // null
       list_builder.append(false).unwrap();
   
       // [null, null]
       let values = list_builder.values();
       values
           .field_builder::<Int32Builder>(0)
           .unwrap()
           .append_null()
           .unwrap();
       values.append(false).unwrap();
       values
           .field_builder::<Int32Builder>(0)
           .unwrap()
           .append_null()
           .unwrap();
       values.append(false).unwrap();
       list_builder.append(true).unwrap();
   
       // [{a: null}]
       let values = list_builder.values();
       values
           .field_builder::<Int32Builder>(0)
           .unwrap()
           .append_null()
           .unwrap();
       values.append(true).unwrap();
       list_builder.append(true).unwrap();
   
       // [{a: 2}]
       let values = list_builder.values();
       values
           .field_builder::<Int32Builder>(0)
           .unwrap()
           .append_value(2)
           .unwrap();
       values.append(true).unwrap();
       list_builder.append(true).unwrap();
   
       let array = Arc::new(list_builder.finish());
   
       let schema = Arc::new(Schema::new(vec![list_field]));
   
       let rb = RecordBatch::try_new(schema, vec![array]).unwrap();
   
       let out = std::fs::File::create("out.parquet").unwrap();
       let mut writer = parquet::arrow::ArrowWriter::try_new(out, rb.schema(), 
None).unwrap();
       writer.write(&rb).unwrap();
       writer.close().unwrap();
   }
   ```
   
   Running `parquet-dump` on `out.parquet` produces the following output
   
   ```
   value 1: R:0 D:4 V:1
   value 2: R:0 D:1 V:<null>
   value 3: R:0 D:0 V:<null>
   value 4: R:0 D:2 V:<null>
   value 5: R:1 D:2 V:<null>
   value 6: R:0 D:3 V:<null>
   value 7: R:0 D:4 V:0
   ```
   
   **Expected behavior**
   The last value (value 7) should have been a 2
   ```
   value 1: R:0 D:4 V:1
   value 2: R:0 D:1 V:<null>
   value 3: R:0 D:0 V:<null>
   value 4: R:0 D:2 V:<null>
   value 5: R:1 D:2 V:<null>
   value 6: R:0 D:3 V:<null>
   value 7: R:0 D:4 V:2
   ```
   
   **Additional context**
   `filter_array_indices` function in 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L760 
produces incorrect indices when the immediate parent of a field is not a list. 
In the writer 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer.rs#L244,
 those indices are then used to produce the values to write at 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer.rs#L284
 causing the incorrect behavior described above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to