[GitHub] [arrow] westonpace opened a new pull request #12147: ARROW-15318: [C++][Python] Regression reading partition keys of large batches.

GitBox Thu, 13 Jan 2022 10:48:24 -0800


westonpace opened a new pull request #12147:
URL: https://github.com/apache/arrow/pull/12147



   Since only partition keys were selected we ended up reading 0 columns from 
the parquet file (we still need to do this so we can determine the row group 
sizes to accurately reflect the files, or at least we would still need to 
determine the total number of rows in each file).
   
   We recently added behavior to the parquet reader to respect a batch size 
parameter.  So if the row group is larger than the batch size we chop the table 
up into smaller batches using a TableBatchReader with a max chunksize.  There 
was a bug in the TableBatchReader so that if there were no columns and the max 
chunksize was smaller than the size of the table (and not evenly divisible into 
the table size) then we would hit an infinite loop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace opened a new pull request #12147: ARROW-15318: [C++][Python] Regression reading partition keys of large batches.

Reply via email to