jonded94 opened a new issue, #9370:
URL: https://github.com/apache/arrow-rs/issues/9370
**Describe the bug**
I'm trying to read a parquet file containing a large number of image bytes
in a columns with `MapType`. Unfortunately, this leads to an error "Not all
children array length are the same!", but *only* if I use a `RowSelection`! If
I omit the `RowSelection` and let the file be traversed normally, my
reproduction test succeeds.
**To Reproduce**
```
mod tests {
use parquet::arrow::arrow_reader::{ArrowReaderBuilder, RowSelection,
RowSelector};
use std::fs::File;
use std::path::PathBuf;
#[test]
fn validate_issue() {
pub fn row_selection_from_indices(indices: &[usize]) -> RowSelection
{
let mut selectors = Vec::new();
let mut last_end = 0;
for &idx in indices {
if idx > last_end {
selectors.push(RowSelector::skip(idx - last_end));
}
selectors.push(RowSelector::select(1));
last_end = idx + 1;
}
selectors.into()
}
let indices = vec![352, 955];
let arrow_reader = ArrowReaderBuilder::try_new(
File::open(PathBuf::from(
"issue_file.parquet",
))
.unwrap(),
)
.unwrap();
let mut batch_reader_builder = arrow_reader;
batch_reader_builder =
batch_reader_builder.with_row_groups(vec![99]);
batch_reader_builder =
batch_reader_builder.with_row_selection(row_selection_from_indices(indices.as_slice()));
// Removing this lets the test suceed again!
let batch_reader = batch_reader_builder.build().unwrap();
for item in batch_reader {
item.unwrap();
}
}
}
```
=> (the debug statements were added my be)
```
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9]
children_array_len = 3
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9]
children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
3,
3,
3,
3,
3,
3,
3,
]
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9]
children_array_len = 3
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9]
children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
3,
3,
]
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:120:9]
children_array_len = 2
[.../arrow-rs/parquet/src/arrow/array_reader/struct_array.rs:121:9]
children_array.iter().map(|arr| arr.len()).collect::<Vec<_>>() = [
2, // <-- Only the first array seems to be of length 2, all others have
length 3
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
]
called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error:
Not all children array length are the same!")
```
**Expected behavior**
Iteration through the parquet file should work without problem, regardless
of whether somebody uses `RowSelection` or not.
**Additional context**
Happens with `arrow-rs` 57.1.0, 57.2.0 and in this specific report I used
commit fb775011.
I unfortunately can't give you the reproduction file, as it contains tons of
confidential stuff, but I shared as much `parquet-viewer` output as possible.
Most probably this is about the `image_data` map, specifically the
`image_bytes` values column?
<img width="2522" height="437" alt="Image"
src="https://github.com/user-attachments/assets/ec67ea13-1ead-4430-af64-041773c38ecc"
/>
<img width="2515" height="933" alt="Image"
src="https://github.com/user-attachments/assets/7f3b0da2-278e-48db-8175-e301a6ae0ddc"
/>
<img width="2515" height="870" alt="Image"
src="https://github.com/user-attachments/assets/0139d818-c5f3-4a81-80d2-f4cb961e4f68"
/>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]