jerryway42 opened a new issue, #10129:
URL: https://github.com/apache/arrow-rs/issues/10129
### Describe the bug
## Describe the bug
`parquet` fails to read metadata for Parquet files with more than `32767`
row groups when `RowGroup.ordinal` is absent.
The failure happens before decoding any data pages, inside
`ParquetRecordBatchReaderBuilder::try_new(...)`.
Error:
```text
Parquet error: Row group ordinal 32768 exceeds i16 max value
```
The same files are readable by PyArrow. Since `RowGroup.ordinal` is
optional Parquet metadata, the reader should not fail solely because it cannot
synthesize an optional `i16` ordinal
for row groups beyond `i16::MAX`.
Current `main` still appears to contain this logic in
`parquet/src/file/metadata/thrift/mod.rs`:
```rust
for ordinal in 0..list_ident.size {
let ordinal: i16 = ordinal.try_into().map_err(|_| {
ParquetError::General(format!(
"Row group ordinal {ordinal} exceeds i16 max value",
))
})?;
let rg = read_row_group(&mut prot, schema_descr, options)?;
rg_vec.push(assigner.ensure(ordinal, rg)?);
}
```
## To Reproduce
Using:
```toml
arrow = "57"
parquet = "57"
```
Code:
```rust
use std::fs::File;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
let file = File::open("transactions.parquet")?;
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?;
```
This fails with:
```text
Row group ordinal 32768 exceeds i16 max value
```
PyArrow can read the same file metadata:
```python
import pyarrow.parquet as pq
pf = pq.ParquetFile("transactions.parquet")
print(pf.metadata.num_rows)
print(pf.metadata.num_row_groups)
```
Observed examples:
```text
num_rows=26,303,646 num_row_groups=37,638
num_rows=31,308,590 num_row_groups=43,990
num_rows=27,487,443 num_row_groups=39,870
num_rows=25,255,685 num_row_groups=34,599
```
The row groups are unusually small, but the files are readable by PyArrow.
## Expected behavior
`ParquetRecordBatchReaderBuilder::try_new(...)` should successfully read
metadata for Parquet files with more than `32767` row groups.
If `RowGroup.ordinal` is absent, the reader should avoid failing just
because a synthetic ordinal cannot fit in `i16`. It could leave
`RowGroupMetaData::ordinal` as `None` for such row
groups, or avoid assigning synthetic ordinals on the read path unless
semantically required.
## Additional context
This seems to be caused by converting the row group index to `i16` before
reading each row group. Since `RowGroup.ordinal` is optional metadata, this
check prevents reading otherwise
valid files with a large number of row groups.
A possible fix would be to read the row group first, then only
assign/check ordinal when it is present or representable.
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]