[
https://issues.apache.org/jira/browse/ARROW-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080545#comment-17080545
]
Radu Teodorescu commented on ARROW-8391:
----------------------------------------
Would you consider allowing for an indexing structure stored in a contiguous
section that contains records like <record batch (id?), offset>?
This structure can itself be represented using arrow constructs (aka as an
arrow table).
If we want to provision for having an overwhelmingly large number of record
batches, to the point where the index itself can provide a significant
communication overhead, I would suggest breaking the index in hierarchical
structure, somewhat similar to a BTree:
* Each Node, represents a fixed size, <batch id,offset> table and it is
represented itself as a record batch in the index table.
* Leaf nodes point to record batches in the original table
* Internal nodes point to other index record batches
The first index table record batch represents the root of the hierarchy.
This way, getting the list of record batches involved in a row range, involves
reading _log B(record batch count)_ index batches, where B is the size of the
index record batch.
> [C++] Implement row range read API for IPC file (and Feather)
> -------------------------------------------------------------
>
> Key: ARROW-8391
> URL: https://issues.apache.org/jira/browse/ARROW-8391
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
>
> The objective would be able to read a range of rows from the middle of a
> file. It's not as easy as it might sound since all the record batch metadata
> must be examined to determine the start and end point of the row range
--
This message was sent by Atlassian Jira
(v8.3.4#803005)