[ 
https://issues.apache.org/jira/browse/ARROW-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080545#comment-17080545
 ] 

Radu Teodorescu commented on ARROW-8391:
----------------------------------------

Would you consider allowing for an indexing structure stored in a contiguous 
section that contains records like <record batch (id?), offset>?

This structure can itself be represented using arrow constructs (aka as an 
arrow table).

If we want to provision for having an overwhelmingly large number of record 
batches, to the point where the index itself can provide a significant 
communication overhead, I would suggest breaking the index in hierarchical 
structure, somewhat similar to a BTree:
 * Each Node, represents a fixed size, <batch id,offset> table and it is 
represented itself as a record batch in the index table.
 * Leaf nodes point to record batches in the original table
 * Internal nodes point to other index record batches

The first index table record batch represents the root of the hierarchy.

This way, getting the list of record batches involved in a row range, involves 
reading _log B(record batch count)_ index batches, where B is the size of the 
index record batch.

 

> [C++] Implement row range read API for IPC file (and Feather)
> -------------------------------------------------------------
>
>                 Key: ARROW-8391
>                 URL: https://issues.apache.org/jira/browse/ARROW-8391
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>
> The objective would be able to read a range of rows from the middle of a 
> file. It's not as easy as it might sound since all the record batch metadata 
> must be examined to determine the start and end point of the row range



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to