[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Weston Pace (Jira) Mon, 24 Jan 2022 11:49:08 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481353#comment-17481353
 ]


Weston Pace commented on ARROW-15413:
-------------------------------------

Ah, good point.  I suppose we could always just read those columns into memory 
but page out the other columns but that gets tricky to understand rather 
quickly.  We could also read the offsets buffer first, then read the data 
buffer next.  This would slow us down in some cases but it might be preferably 
to slow down than to run out of RAM.

> [C++][Datasets] Investigate sub-batch IPC reads
> -----------------------------------------------
>
>                 Key: ARROW-15413
>                 URL: https://issues.apache.org/jira/browse/ARROW-15413
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> When scanning an IPC file the finest resolution we can read currently is a 
> record batch.  Often we are processing relatively small slices of that batch 
> in an iterative fashion.  This means we sometimes have to read in and hold a 
> huge batch of memory while we slice off small pieces of it.
> For example, if a user creates an IPC file with 1 record batch with 50 
> million rows and we want to process it in batches of 64K rows we have to 
> first read the entire 50 million rows in memory and then slice off the 64K 
> sub-batches.
> We should be able to create a sub-batch reader (although this will be more 
> complicated in the future with things like RLE columns) which can slice small 
> pieces of the batch off the disk instead of reading the entire batch into 
> memory first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Reply via email to