helgikrs opened a new issue #1184:
URL: https://github.com/apache/arrow-rs/issues/1184
**Describe the bug**
Writing an arrow record batch with structs nested within lists using the
parquet writer produces a parquet file with incorrect values when there are
null or empty lists present.
**To Reproduce**
The following program produces a parquet file `out.parquet`.
```rust
use std::sync::Arc;
use arrow::array::{Int32Builder, ListBuilder, StructBuilder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
fn main() {
// define schema
let int_field = Field::new("a", DataType::Int32, true);
let item_field = Field::new("item",
DataType::Struct(vec![int_field.clone()]), true);
let list_field = Field::new("list",
DataType::List(Box::new(item_field)), true);
let int_builder = Int32Builder::new(10);
let struct_builder = StructBuilder::new(vec![int_field],
vec![Box::new(int_builder)]);
let mut list_builder = ListBuilder::new(struct_builder);
// [{a: 1}], [], null, [null, null], [{a: null}], [{a: 2}]
//
// [{a: 1}]
let values = list_builder.values();
values
.field_builder::<Int32Builder>(0)
.unwrap()
.append_value(1)
.unwrap();
values.append(true).unwrap();
list_builder.append(true).unwrap();
// []
list_builder.append(true).unwrap();
// null
list_builder.append(false).unwrap();
// [null, null]
let values = list_builder.values();
values
.field_builder::<Int32Builder>(0)
.unwrap()
.append_null()
.unwrap();
values.append(false).unwrap();
values
.field_builder::<Int32Builder>(0)
.unwrap()
.append_null()
.unwrap();
values.append(false).unwrap();
list_builder.append(true).unwrap();
// [{a: null}]
let values = list_builder.values();
values
.field_builder::<Int32Builder>(0)
.unwrap()
.append_null()
.unwrap();
values.append(true).unwrap();
list_builder.append(true).unwrap();
// [{a: 2}]
let values = list_builder.values();
values
.field_builder::<Int32Builder>(0)
.unwrap()
.append_value(2)
.unwrap();
values.append(true).unwrap();
list_builder.append(true).unwrap();
let array = Arc::new(list_builder.finish());
let schema = Arc::new(Schema::new(vec![list_field]));
let rb = RecordBatch::try_new(schema, vec![array]).unwrap();
let out = std::fs::File::create("out.parquet").unwrap();
let mut writer = parquet::arrow::ArrowWriter::try_new(out, rb.schema(),
None).unwrap();
writer.write(&rb).unwrap();
writer.close().unwrap();
}
```
Running `parquet-dump` on `out.parquet` produces the following output
```
value 1: R:0 D:4 V:1
value 2: R:0 D:1 V:<null>
value 3: R:0 D:0 V:<null>
value 4: R:0 D:2 V:<null>
value 5: R:1 D:2 V:<null>
value 6: R:0 D:3 V:<null>
value 7: R:0 D:4 V:0
```
**Expected behavior**
The last value (value 7) should have been a 2
```
value 1: R:0 D:4 V:1
value 2: R:0 D:1 V:<null>
value 3: R:0 D:0 V:<null>
value 4: R:0 D:2 V:<null>
value 5: R:1 D:2 V:<null>
value 6: R:0 D:3 V:<null>
value 7: R:0 D:4 V:2
```
**Additional context**
`filter_array_indices` function in
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L760
produces incorrect indices when the immediate parent of a field is not a list.
In the writer
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer.rs#L244,
those indices are then used to produce the values to write at
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer.rs#L284
causing the incorrect behavior described above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]