jerryway42 opened a new issue, #10129:
URL: https://github.com/apache/arrow-rs/issues/10129

   ### Describe the bug
   
   ## Describe the bug
   
     `parquet` fails to read metadata for Parquet files with more than `32767` 
row groups when `RowGroup.ordinal` is absent.
   
     The failure happens before decoding any data pages, inside 
`ParquetRecordBatchReaderBuilder::try_new(...)`.
   
     Error:
   
     ```text
     Parquet error: Row group ordinal 32768 exceeds i16 max value
     ```
   
     The same files are readable by PyArrow. Since `RowGroup.ordinal` is 
optional Parquet metadata, the reader should not fail solely because it cannot 
synthesize an optional `i16` ordinal
     for row groups beyond `i16::MAX`.
   
     Current `main` still appears to contain this logic in 
`parquet/src/file/metadata/thrift/mod.rs`:
   
     ```rust
     for ordinal in 0..list_ident.size {
         let ordinal: i16 = ordinal.try_into().map_err(|_| {
             ParquetError::General(format!(
                 "Row group ordinal {ordinal} exceeds i16 max value",
             ))
         })?;
         let rg = read_row_group(&mut prot, schema_descr, options)?;
         rg_vec.push(assigner.ensure(ordinal, rg)?);
     }
     ```
   
     ## To Reproduce
   
     Using:
   
     ```toml
     arrow = "57"
     parquet = "57"
     ```
   
     Code:
   
     ```rust
     use std::fs::File;
     use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
   
     let file = File::open("transactions.parquet")?;
     let reader = ParquetRecordBatchReaderBuilder::try_new(file)?;
     ```
   
     This fails with:
   
     ```text
     Row group ordinal 32768 exceeds i16 max value
     ```
   
     PyArrow can read the same file metadata:
   
     ```python
     import pyarrow.parquet as pq
   
     pf = pq.ParquetFile("transactions.parquet")
     print(pf.metadata.num_rows)
     print(pf.metadata.num_row_groups)
     ```
   
     Observed examples:
   
     ```text
     num_rows=26,303,646  num_row_groups=37,638
     num_rows=31,308,590  num_row_groups=43,990
     num_rows=27,487,443  num_row_groups=39,870
     num_rows=25,255,685  num_row_groups=34,599
     ```
   
     The row groups are unusually small, but the files are readable by PyArrow.
   
     ## Expected behavior
   
     `ParquetRecordBatchReaderBuilder::try_new(...)` should successfully read 
metadata for Parquet files with more than `32767` row groups.
   
     If `RowGroup.ordinal` is absent, the reader should avoid failing just 
because a synthetic ordinal cannot fit in `i16`. It could leave 
`RowGroupMetaData::ordinal` as `None` for such row
     groups, or avoid assigning synthetic ordinals on the read path unless 
semantically required.
   
     ## Additional context
   
     This seems to be caused by converting the row group index to `i16` before 
reading each row group. Since `RowGroup.ordinal` is optional metadata, this 
check prevents reading otherwise
     valid files with a large number of row groups.
   
     A possible fix would be to read the row group first, then only 
assign/check ordinal when it is present or representable.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to