[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

David Li (Jira) Mon, 24 Jan 2022 05:12:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481072#comment-17481072
 ]


David Li commented on ARROW-15413:
----------------------------------

I think this is not possible except for fixed-width columns. For something like 
a string column, you would have to read the offsets first to know which slice 
of the data buffer to read. (Or you could always read the entire data buffer, 
but this presumably misses the point.)

> [C++][Datasets] Investigate sub-batch IPC reads
> -----------------------------------------------
>
>                 Key: ARROW-15413
>                 URL: https://issues.apache.org/jira/browse/ARROW-15413
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> When scanning an IPC file the finest resolution we can read currently is a 
> record batch.  Often we are processing relatively small slices of that batch 
> in an iterative fashion.  This means we sometimes have to read in and hold a 
> huge batch of memory while we slice off small pieces of it.
> For example, if a user creates an IPC file with 1 record batch with 50 
> million rows and we want to process it in batches of 64K rows we have to 
> first read the entire 50 million rows in memory and then slice off the 64K 
> sub-batches.
> We should be able to create a sub-batch reader (although this will be more 
> complicated in the future with things like RLE columns) which can slice small 
> pieces of the batch off the disk instead of reading the entire batch into 
> memory first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15413) [C++][Datasets] Investigate sub-batch IPC reads

Reply via email to