[
https://issues.apache.org/jira/browse/ARROW-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285901#comment-17285901
]
Antoine Pitrou commented on ARROW-645:
--------------------------------------
It's not obvious to me we want to add that kind of complication to the IPC
format. This threatens to turn the IPC format to a Parquet-like spec with niche
optional features that just fragment the ecosystem and makes it difficult to
predict whether two endpoints will be compatible with each other.
Do people actually have such mega-wide schemas? What is the use case for having
1e6 fields in a schema?
> [Format] Mitigating the cost of random access in "wide" record batches
> ----------------------------------------------------------------------
>
> Key: ARROW-645
> URL: https://issues.apache.org/jira/browse/ARROW-645
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Format
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 4.0.0
>
>
> In very large schemas, due of the way we are flattening the field and buffer
> metadata in the RecordBatch:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L271
> The cost to reconstruct / load a single array from a RecordBatch can be
> arbitrarily high.
> As an example, let's consider a schema:
> {code}
> f0: int32
> f1: string
> ... omitting 999996 duplicate
> f999998: int32
> f999999: string
> {code}
> Here, a record batch has 1 million fields, and in total 2.5 million buffers.
> The problem with this is: to select a single field out of a record batch, we
> have to inspect all types leading up to the field of interest to know how
> many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the
> serialized metadata before that field's metadata appears.
> Solving this is a little bit tricky. One way would be to add optional "field
> position" and "buffer position" attributes to the {{Field}} table:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L188
> So here, we would know that for the {{f1}} field, the field index is 1 and
> the buffer index is 2. Because a string has 3 buffers associated with it, we
> would know to select buffers in slots 2, 3, 4 to reconstruct the vector
> container.
> Let me know if the problem is not clear, and any other ideas about solutions
--
This message was sent by Atlassian Jira
(v8.3.4#803005)