[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

GitBox Wed, 15 Jul 2020 20:59:01 -0700


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-659143356



   So it appears there were changes to the underlying implementation of 
RecordBatchReader. Prior to these changes, it would yield record batches with 
the exact batch size (if possible). So for a `batch_size` of 900 on a file 
written with `chunk_size` of 1,000, it would yield batches of: 900, 900, 900, 
900, ... rows. Now it yields slices that are aligned with the rowgroups, so the 
same parameters would yield batches with rowcounts: 900, 100, 900, 100, and so 
on.
   
   I'm not 100% sure if we care about the exact number of rows returned, but 
for now I'm leaning towards yes. Open to feedback on that. In the meantime, I 
will push changes soon that will stitch together the batches to yield 
consistent rowcounts.
   
   A cool side affect of these changes is is that it gets around the bug I 
mentioned earlier that would have blocked support for categorical columns in 
this method.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

Reply via email to