[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Weston Pace (Jira) Fri, 21 Jan 2022 18:24:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480328#comment-17480328
 ]


Weston Pace commented on ARROW-15413:
-------------------------------------

[~lidavidm] We discussed this once before I think.  Can I get a quick sanity 
check that this should indeed be possible?

> [C++][Datasets] Investigate sub-batch IPC reads
> -----------------------------------------------
>
>                 Key: ARROW-15413
>                 URL: https://issues.apache.org/jira/browse/ARROW-15413
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> When scanning an IPC file the finest resolution we can read currently is a 
> record batch.  Often we are processing relatively small slices of that batch 
> in an iterative fashion.  This means we sometimes have to read in and hold a 
> huge batch of memory while we slice off small pieces of it.
> For example, if a user creates an IPC file with 1 record batch with 50 
> million rows and we want to process it in batches of 64K rows we have to 
> first read the entire 50 million rows in memory and then slice off the 64K 
> sub-batches.
> We should be able to create a sub-batch reader (although this will be more 
> complicated in the future with things like RLE columns) which can slice small 
> pieces of the batch off the disk instead of reading the entire batch into 
> memory first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Reply via email to