[jira] [Updated] (ARROW-645) [Format] Mitigating the cost of random access in "wide" record batches

Wes McKinney (JIRA) Thu, 16 Mar 2017 21:05:02 -0700

     [ 
https://issues.apache.org/jira/browse/ARROW-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney updated ARROW-645:
-------------------------------
    Description: 
In very large schemas, due of the way we are flattening the field and buffer 
metadata in the RecordBatch:

https://github.com/apache/arrow/blob/master/format/Message.fbs#L271

The cost to reconstruct / load a single array from a RecordBatch can be 
arbitrarily high. 

As an example, let's consider a schema:

{code}
f0: int32
f1: string
...  omitting 999996 duplicate
f999998: int32
f999999: string
{code}

Here, a record batch has 1 million fields, and in total 2.5 million buffers. 
The problem with this is: to select a single field out of a record batch, we 
have to inspect all types leading up to the field of interest to know how many 
{{FieldNode}} and {{Buffer}} metadata values will have occurred in the 
serialized metadata before that field's metadata appears.

Solving this is a little bit tricky. One way would be to add optional "field 
position" and "buffer position" attributes to the {{Field}} table:

https://github.com/apache/arrow/blob/master/format/Message.fbs#L188

So here, we would know that for the {{f1}} field, the field index is 1 and the 
buffer index is 2. Because a string has 3 buffers associated with it, we would 
know to select buffers in slots 2, 3, 4 to reconstruct the vector container. 

Let me know if the problem is not clear, and any other ideas about solutions

  was:
In very large schemas, due of the way we are flattening the field and buffer 
metadata in the RecordBatch:

https://github.com/apache/arrow/blob/master/format/Message.fbs#L271

The cost to reconstruct / load a single array from a RecordBatch can be 
arbitrarily high. 

As an example, let's consider a schema:

{code}
f0: int32
f1: string
...  omitting 999996 duplicate
f999998: int32
f999999: string
{code}

Here, a record batch has 1 million fields, and in total 1.5 million buffers. 
The problem with this is: to select a single field out of a record batch, we 
have to inspect all types leading up to the field of interest to know how many 
{{FieldNode}} and {{Buffer}} metadata values will have occurred in the 
serialized metadata before that field's metadata appears.

Solving this is a little bit tricky. One way would be to add optional "field 
position" and "buffer position" attributes to the {{Field}} table:

https://github.com/apache/arrow/blob/master/format/Message.fbs#L188

So here, we would know that for the {{f1}} field, the field index is 1 and the 
buffer index is 2. Because a string has 3 buffers associated with it, we would 
know to select buffers in slots 2, 3, 4 to reconstruct the vector container. 

Let me know if the problem is not clear, and any other ideas about solutions


> [Format] Mitigating the cost of random access in "wide" record batches
> ----------------------------------------------------------------------
>
>                 Key: ARROW-645
>                 URL: https://issues.apache.org/jira/browse/ARROW-645
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Format
>            Reporter: Wes McKinney
>
> In very large schemas, due of the way we are flattening the field and buffer 
> metadata in the RecordBatch:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L271
> The cost to reconstruct / load a single array from a RecordBatch can be 
> arbitrarily high. 
> As an example, let's consider a schema:
> {code}
> f0: int32
> f1: string
> ...  omitting 999996 duplicate
> f999998: int32
> f999999: string
> {code}
> Here, a record batch has 1 million fields, and in total 2.5 million buffers. 
> The problem with this is: to select a single field out of a record batch, we 
> have to inspect all types leading up to the field of interest to know how 
> many {{FieldNode}} and {{Buffer}} metadata values will have occurred in the 
> serialized metadata before that field's metadata appears.
> Solving this is a little bit tricky. One way would be to add optional "field 
> position" and "buffer position" attributes to the {{Field}} table:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L188
> So here, we would know that for the {{f1}} field, the field index is 1 and 
> the buffer index is 2. Because a string has 3 buffers associated with it, we 
> would know to select buffers in slots 2, 3, 4 to reconstruct the vector 
> container. 
> Let me know if the problem is not clear, and any other ideas about solutions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (ARROW-645) [Format] Mitigating the cost of random access in "wide" record batches

Reply via email to