[
https://issues.apache.org/jira/browse/ARROW-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480328#comment-17480328
]
Weston Pace commented on ARROW-15413:
-------------------------------------
[~lidavidm] We discussed this once before I think. Can I get a quick sanity
check that this should indeed be possible?
> [C++][Datasets] Investigate sub-batch IPC reads
> -----------------------------------------------
>
> Key: ARROW-15413
> URL: https://issues.apache.org/jira/browse/ARROW-15413
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> When scanning an IPC file the finest resolution we can read currently is a
> record batch. Often we are processing relatively small slices of that batch
> in an iterative fashion. This means we sometimes have to read in and hold a
> huge batch of memory while we slice off small pieces of it.
> For example, if a user creates an IPC file with 1 record batch with 50
> million rows and we want to process it in batches of 64K rows we have to
> first read the entire 50 million rows in memory and then slice off the 64K
> sub-batches.
> We should be able to create a sub-batch reader (although this will be more
> complicated in the future with things like RLE columns) which can slice small
> pieces of the batch off the disk instead of reading the entire batch into
> memory first.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)