tustvold commented on PR #2027: URL: https://github.com/apache/arrow-rs/pull/2027#issuecomment-1179247516
So the batch_size bug was a bug, but would be masked by the MIN_BATCH_SIZE setting of 1024. The actual cause of the failure was more subtle. The bug was that once exhausted RecordReader would continue to re-read the last record on subsequent calls to read_records. This would only occur if it returned exactly the batch_size number of records, when reaching the end of a chunk. This wouldn't actually read any new data and so would end up actually returning less records than it claimed to have read. So `read_records` would do the following * Read 8 records * Return the corresponding values to the caller * Read 1 phantom record * Read 7 records * Return this to the caller claiming it to be 8 records, but actually only being 7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
