alamb opened a new pull request #8007:
URL: https://github.com/apache/arrow/pull/8007


   When I was reading a parquet file into `RecordBatches` using 
`ParquetFileArrowReader` that had row groups that were 100,000 rows in length 
with a batch size of 60,000, after reading 300,000 rows successfully, I started 
seeing this error
   
   ```
    ParquetError("Parquet error: Not all children array length are the same!")
   ```
   
   Upon investigation, I found that when reading with `ParquetFileArrowReader`, 
if the parquet input file has multiple row groups, and if a batch happens to 
end at the end of a row group for Int or Float, no subsequent row groups are 
read
   
   Visually:
   
   ```
   +-----+
   | RG1 |
   |     |
   +-----+  <-- If a batch ends exactly at the end of this row group (page), 
RG2 is never read
   +-----+
   | RG2 |
   |     |
   +-----+
   ```
   
   I traced the issue down to a bug in `PrimitiveArrayReader` where it 
mistakenly interprets reading `0` rows from a page reader as being at the end 
of the column.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to