[jira] [Commented] (ARROW-645) [Format] Mitigating the cost of random access in "wide" record batches

Antoine Pitrou (Jira) Wed, 17 Feb 2021 07:52:13 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285901#comment-17285901
 ]


Antoine Pitrou commented on ARROW-645:
--------------------------------------

It's not obvious to me we want to add that kind of complication to the IPC 
format. This threatens to turn the IPC format to a Parquet-like spec with niche 
optional features that just fragment the ecosystem and makes it difficult to 
predict whether two endpoints will be compatible with each other.

Do people actually have such mega-wide schemas? What is the use case for having 
1e6 fields in a schema?

> [Format] Mitigating the cost of random access in "wide" record batches
> ----------------------------------------------------------------------
>
>                 Key: ARROW-645
>                 URL: https://issues.apache.org/jira/browse/ARROW-645
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Format
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 4.0.0
>
>
> In very large schemas, due of the way we are flattening the field and buffer 
> metadata in the RecordBatch:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L271
> The cost to reconstruct / load a single array from a RecordBatch can be 
> arbitrarily high. 
> As an example, let's consider a schema:
> {code}
> f0: int32
> f1: string
> ...  omitting 999996 duplicate
> f999998: int32
> f999999: string
> {code}
> Here, a record batch has 1 million fields, and in total 2.5 million buffers. 
> The problem with this is: to select a single field out of a record batch, we 
> have to inspect all types leading up to the field of interest to know how 
> many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the 
> serialized metadata before that field's metadata appears.
> Solving this is a little bit tricky. One way would be to add optional "field 
> position" and "buffer position" attributes to the {{Field}} table:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L188
> So here, we would know that for the {{f1}} field, the field index is 1 and 
> the buffer index is 2. Because a string has 3 buffers associated with it, we 
> would know to select buffers in slots 2, 3, 4 to reconstruct the vector 
> container. 
> Let me know if the problem is not clear, and any other ideas about solutions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-645) [Format] Mitigating the cost of random access in "wide" record batches

Reply via email to