alamb opened a new pull request #8007:
URL: https://github.com/apache/arrow/pull/8007
When I was reading a parquet file into `RecordBatches` using
`ParquetFileArrowReader` that had row groups that were 100,000 rows in length
with a batch size of 60,000, after reading 300,000 rows successfully, I started
seeing this error
```
ParquetError("Parquet error: Not all children array length are the same!")
```
Upon investigation, I found that when reading with `ParquetFileArrowReader`,
if the parquet input file has multiple row groups, and if a batch happens to
end at the end of a row group for Int or Float, no subsequent row groups are
read
Visually:
```
+-----+
| RG1 |
| |
+-----+ <-- If a batch ends exactly at the end of this row group (page),
RG2 is never read
+-----+
| RG2 |
| |
+-----+
```
I traced the issue down to a bug in `PrimitiveArrayReader` where it
mistakenly interprets reading `0` rows from a page reader as being at the end
of the column.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]