[GitHub] [arrow-rs] tustvold commented on pull request #2027: Fix record delimiting on row group boundaries (#2025)

GitBox Fri, 08 Jul 2022 11:12:18 -0700


tustvold commented on PR #2027:
URL: https://github.com/apache/arrow-rs/pull/2027#issuecomment-1179247516


   So the batch_size bug was a bug, but would be masked by the MIN_BATCH_SIZE 
setting of 1024. The actual cause of the failure was more subtle. The bug was 
that once exhausted RecordReader would continue to re-read the last record on 
subsequent calls to read_records. This would only occur if it returned exactly 
the batch_size number of records, when reaching the end of a chunk. This 
wouldn't actually read any new data and so would end up actually returning less 
records than it claimed to have read.
   
   So `read_records` would do the following
   
   * Read 8 records
   * Return the corresponding values to the caller
   * Read 1 phantom record
   * Read 7 records
   * Return this to the caller claiming it to be 8 records, but actually only 
being 7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on pull request #2027: Fix record delimiting on row group boundaries (#2025)

Reply via email to