Weston Pace created ARROW-14024:
-----------------------------------

             Summary: [C++] ScanOptions::batch_size not respected in 
parquet/IPC readers
                 Key: ARROW-14024
                 URL: https://issues.apache.org/jira/browse/ARROW-14024
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


At first glance it seems like Parquet's reader should work.  The 
ScanOptions::batch_size property is forwarded into the ArrowReaderProperties 
for the parquet::arrow::FileReader.  However, we then use ReadOneRowGroup which 
doesn't look at the batch_size option.

The IPC reader simply doesn't look at the property at all.

Even if we can't control the source read size (e.g. we have to read a full row 
group / record batch and have no control over its size) we can still split 
whatever we read into smaller batches that respect the batch size.  This is 
important for achieving parallelism as we can then partition the CPU work 
across these batches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to