[ 
https://issues.apache.org/jira/browse/ARROW-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685423#comment-15685423
 ] 

Wes McKinney edited comment on ARROW-384 at 11/22/16 2:08 AM:
--------------------------------------------------------------

This seems reasonable, and saves you from always requiring the metadata size. 

If we look at the flatbuffers IDL for the RecordBatch and Buffer metadata 
(https://github.com/apache/arrow/blob/master/format/Message.fbs#L210), it says 
the offset is "The relative offset into the shared memory page where the bytes 
for this buffer starts". This is somewhat rigid because it means that file-like 
record batches are not easily relocatable -- by that definition, the offset 
would need to be the position in the file relative to the start, not the start 
of the record batch. This is what the C++ implementation is doing now. 

Here's my idea: add an enum flag to RecordBatch that indicates whether the 
buffer offsets are absolute (relative to the start of the file or shared memory 
block) or relative to a contiguous blob of bytes (what the Java file 
implementation is doing now). The latter is not good necessarily for shared 
memory because it presumes contiguousness, but it also makes record batches 
relocatable when they are (e.g. in a file-like setting).

Relocatable record batch metadata / fully relative offsets is also better for 
RPC / socket-based exchange (which is effectively the same as sending a segment 
of the current "file format"), so that's another argument for adding that as an 
option.

I don't think I can make an argument that either absolute (needed for general 
shared memory IPC) or relative (better for file / RPC) offsets should be the 
only option available to the exclusion of the other. 


was (Author: wesmckinn):
This seems reasonable, and saves you from always requiring the metadata size. 

If we look at the flatbuffers IDL for the RecordBatch and Buffer metadata 
(https://github.com/apache/arrow/blob/master/format/Message.fbs#L210), it says 
the offset is "The relative offset into the shared memory page where the bytes 
for this buffer starts". This is somewhat rigid because it means that file-like 
record batches are not easily relocatable -- by that definition, the offset 
would need to be the position in the file relative to the start, not the start 
of the record batch. This is what the C++ implementation is doing now. 

Here's my idea: add an enum flag to RecordBatch that indicates whether the 
buffer offsets are absolute (relative to the start of the file or shared memory 
block) or relative to a contiguous blob of bytes (what the Java file 
implementation is doing now). The latter is not good necessarily for shared 
memory because it presumes contiguousness, but it also makes record batches 
relocatable when they are (e.g. in a file-like setting).

Relocatable record batch metadata / fully relative offsets is also better for 
RPC / socket-based exchange (which is effectively the same as sending a segment 
of the current "file format"), so that's another argument for adding that as an 
option.

I don't think I can make an argument that absolute (needed for general shared 
memory IPC) or relative (better for file / RPC) offsets as the only option. 

> Align Java and C++ RecordBatch data and metadata layout
> -------------------------------------------------------
>
>                 Key: ARROW-384
>                 URL: https://issues.apache.org/jira/browse/ARROW-384
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Julien Le Dem
>
> layout on C++ side:
> {noformat}
> <buffers> <metadata> <metadata size: int32>
> {noformat}
> and on the java side:
> {noformat}
> <metadata> <buffers>
> {noformat}
> In the file format the footer has a Block info that contains the metadata 
> length.
> https://github.com/apache/arrow/blob/f082b17323354dc2af31f39c15c58b995ba08360/format/File.fbs#L36
> See:
> https://github.com/apache/arrow/pull/211#issuecomment-262080545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to