mmaitre314 opened a new issue, #5064:
URL: https://github.com/apache/arrow-rs/issues/5064
**Describe the bug**
I am trying to read a Parquet file generated by a Hadoop MapReduce job. The
schema is a bit complex, with a minimal repro looking something like this:
```
message schema {
REPEATED group level1 {
REPEATED group level2 {
REQUIRED group level3 {
REQUIRED INT64 value3;
}
}
REQUIRED INT64 value1;
}
}
```
Reading this schema fails with `Path ColumnPath { parts: [\"value1\"] } not
found`. The error message is correct: `value1` does not exist. It is off by one
level and should be `level1.value1`.
Looking into the code, it seems like there is a double `path.pop()`
happening in `reader_tree()`.
**To Reproduce**
Minimal unit test reproing the issue:
- Cargo.toml
```toml
[package]
name = "parquet_complex_type"
version = "0.1.0"
edition = "2021"
[dependencies]
bytes = "1.5.0"
parquet = "48.0.0"
```
- src\lib.rs
```rust
#[cfg(test)]
mod tests {
use bytes::Bytes;
use std::sync::Arc;
use parquet::data_type::Int64Type;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::file::writer::SerializedFileWriter;
use parquet::schema::parser::parse_message_type;
#[test]
fn test_read_write_parquet2() {
// Create schema
let schema = Arc::new(parse_message_type("
message schema {
REPEATED group level1 {
REPEATED group level2 {
REQUIRED group level3 {
REQUIRED INT64 value3;
}
}
REQUIRED INT64 value1;
}
}").unwrap());
// Write Parquet file to buffer
let mut buffer: Vec<u8> = Vec::new();
let mut file_writer = SerializedFileWriter::new(&mut buffer, schema,
Default::default()).unwrap();
let mut row_group_writer = file_writer.next_row_group().unwrap();
// Write column level1.level2.level3.value3
let mut column_writer =
row_group_writer.next_column().unwrap().unwrap();
column_writer.typed::<Int64Type>().write_batch(&[30, 31, 32],
Some(&[2, 2, 2]), Some(&[0, 0, 0])).unwrap();
column_writer.close().unwrap();
// Write column level1.value1
let mut column_writer =
row_group_writer.next_column().unwrap().unwrap();
column_writer.typed::<Int64Type>().write_batch(&[10, 11, 12],
Some(&[1, 1, 1]), Some(&[0, 0, 0])).unwrap();
column_writer.close().unwrap();
// Finalize Parquet file
row_group_writer.close().unwrap();
file_writer.close().unwrap();
assert_eq!(&buffer[0..4], b"PAR1");
// Read Parquet file from buffer
let reader = SerializedFileReader::new(Bytes::from(buffer)).unwrap();
assert_eq!(3, reader.metadata().file_metadata().num_rows());
let row_group_reader = reader.get_row_group(0).unwrap();
let mut rows = row_group_reader.get_row_iter(None).unwrap();
assert_eq!(rows.next().unwrap().unwrap().to_string(), "{level1:
[{level2: [{level3: {value3: 30}}], value1: 10}]}");
assert_eq!(rows.next().unwrap().unwrap().to_string(), "{level1:
[{level2: [{level3: {value3: 31}}], value1: 11}]}");
assert_eq!(rows.next().unwrap().unwrap().to_string(), "{level1:
[{level2: [{level3: {value3: 32}}], value1: 12}]}");
}
}
```
**Expected behavior**
Reading the Parquet file succeeds.
**Additional context**
A fix may be to push back the value just popped so that it can be popped
again right after. This 2-line change in `reader_tree()` got the test to pass.
I am very new to that codebase but happy to send that as Pull Request along
with the test if that fix makes sense.
```rust
_ if repetition == Repetition::REPEATED => {
let required_field = Type::group_type_builder(field.name())
.with_repetition(Repetition::REQUIRED)
.with_converted_type(field.get_basic_info().converted_type())
.with_fields(field.get_fields().to_vec())
.build()?;
let value = path.pop().unwrap(); <--- keep the path
value here
let reader = self.reader_tree(
Arc::new(required_field),
path,
curr_def_level,
curr_rep_level,
paths,
row_group_reader,
)?;
path.push(value); <--- push the path
value back here
Reader::RepeatedReader(
field,
curr_def_level - 1,
curr_rep_level - 1,
Box::new(reader),
)
}
```
<!--
Add any other context about the problem here.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]