wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-659143356


   So it appears there were changes to the underlying implementation of 
RecordBatchReader. Prior to these changes, it would yield record batches with 
the exact batch size (if possible). So for a `batch_size` of 900 on a file 
written with `chunk_size` of 1,000, it would yield batches of: 900, 900, 900, 
900, ... rows. Now it yields slices that are aligned with the rowgroups, so the 
same parameters would yield batches with rowcounts: 900, 100, 900, 100, and so 
on.
   
   I'm not 100% sure if we care about the exact number of rows returned, but 
for now I'm leaning towards yes. Open to feedback on that. In the meantime, I 
will push changes soon that will stitch together the batches to yield 
consistent rowcounts.
   
   A cool side affect of these changes is is that it gets around the bug I 
mentioned earlier that would have blocked support for categorical columns in 
this method.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to