Wes McKinney created ARROW-645:
----------------------------------
Summary: [Format] Mitigating the cost of random access in "wide"
record batches
Key: ARROW-645
URL: https://issues.apache.org/jira/browse/ARROW-645
Project: Apache Arrow
Issue Type: New Feature
Components: Format
Reporter: Wes McKinney
In very large schemas, due of the way we are flattening the field and buffer
metadata in the RecordBatch:
https://github.com/apache/arrow/blob/master/format/Message.fbs#L271
The cost to reconstruct / load a single array from a RecordBatch can be
arbitrarily high.
As an example, let's consider a schema:
{code}
f0: int32
f1: string
... omitting 999996 duplicate
f999998: int32
f999999: string
{code}
Here, a record batch has 1 million fields, and in total 1.5 million buffers.
The problem with this is: to select a single field out of a record batch, we
have to inspect all types leading up to the field of interest to know how many
{{FieldNode}} and {{Buffer}} metadata values will have occurred in the
serialized metadata before that field's metadata appears.
Solving this is a little bit tricky. One way would be to add optional "field
position" and "buffer position" attributes to the {{Field}} table:
https://github.com/apache/arrow/blob/master/format/Message.fbs#L188
So here, we would know that for the {{f1}} field, the field index is 1 and the
buffer index is 2. Because a string has 3 buffers associated with it, we would
know to select buffers in slots 2, 3, 4 to reconstruct the vector container.
Let me know if the problem is not clear, and any other ideas about solutions
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)