[jira] [Created] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Weston Pace (Jira) Fri, 21 Jan 2022 18:23:13 -0800

Weston Pace created ARROW-15413:
-----------------------------------

             Summary: [C++][Datasets] Investigate sub-batch IPC reads
                 Key: ARROW-15413
                 URL: https://issues.apache.org/jira/browse/ARROW-15413
             Project: Apache Arrow
          Issue Type: Sub-task
          Components: C++
            Reporter: Weston Pace



When scanning an IPC file the finest resolution we can read currently is a 
record batch.  Often we are processing relatively small slices of that batch in 
an iterative fashion.  This means we sometimes have to read in and hold a huge 
batch of memory while we slice off small pieces of it.

For example, if a user creates an IPC file with 1 record batch with 50 million 
rows and we want to process it in batches of 64K rows we have to first read the 
entire 50 million rows in memory and then slice off the 64K sub-batches.

We should be able to create a sub-batch reader (although this will be more 
complicated in the future with things like RLE columns) which can slice small 
pieces of the batch off the disk instead of reading the entire batch into 
memory first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Reply via email to