Re: Format specification document?

Kohei KaiGai Sat, 05 Jan 2019 18:44:54 -0800

Hello McKinney,

After the post of my first message, I could find out a significant
documentation here:
https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example


Unlike my expectation, flatbuffer mechanism has much different
structure on-memory image.
So, let's review the Apache Arrow file binary according to the documentation...

000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00

The first 8bytes are signature of "ARROW1\0\0\0", then following
4bytes are length of
the metadata regardless of the flatbuffer. Then, we could fetch
0x0010(int) at 0x000c.
It indicates 0x000c + 0x0010 is the root table.

A int value at 0x001c is 0x000a. It means 0x001c - 0x000a = 0x0012 begins vtable
structure.
0x0012  0a 00  --> vtable length = 10bytes (5 items)
0x0014  0e 00  --> table length = 14 bytes; including the negative
offset (4bytes)
0x0016  06 00  --> table 0x001c + 0x0006 is metadata version (short)
0x0018  05 00  --> table 0x001c + 0x0005 is message header (byte)
0x001a  08 00  --> table 0x001c + 0x0008 is header offset (int)
0x001c  0a 00 00 00  --> negative offset to the vtable

So, we can know this file contains Apache Arrow V4 format, then header
begins from
at 0x0024 + 0x0010.

000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00

Next, 0x0034 is position of the current table. It indicates 0x0034 -
0x000a is vtable.

0x002a  0a 00  --> vtable length = 10bytes (5items)
0x002c  0c 00  --> table length = 14bytes; including the negative
offset (4bytes)
0x002e  00 00  --> Schema::endianness is default (0 = little endian)
0x0030  04 00  --> Schema::fields[]
0x0032  08 00  --> Schema::custom_metadata[]

It says Schema::fields[] begins at 0x0038 + 0x03ec = 0x0424, and also says
Schema::custom_metadata[] begins at 0x003a + 0x0004 = 0x0040.

>From 0x0040:
000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c

The binary from 0x0060 is a cstring ("pandas\0"), and the binary from
0x006c is also a cstring of JSON.

The location indicated by 0x0040 has number of vector element.
So, this metadata contains one key-value pair.
Next int word indicates the sub-table at 0x0050. Its vtable is below:
0x0048  08 00  --> vtable length = 8bytes (4items)
0x004a  0c 00  --> table length  = 12bytes; including the negative
offset (4bytes)
0x004c  04 00  --> cstring offset (key) is at 0x0050 + 0x0004
0x004e  08 00  --> cstring offset (value) is at 0x0050 + 0x0008

Key is at 0x0054 + 0x0008. Here is a int value: 0x0006. It means
cstring length is
6bytes and the next byte (0x0060) begins the cstring body. ("pandas\0").
Value is at 0x0058 + 0x0010. Here is a int value: 0x03b4 (= 948byes), then
the next byte (0x006c) begins the cstring body. ("{pandas_version ... ).


I didn't follow the entire data file, however, it makes me more clear.
Best regards,

2019年1月6日(日) 8:50 Wes McKinney <[email protected]>:
>
> hi Kohei,
>
> On Thu, Jan 3, 2019 at 7:14 PM Kohei KaiGai <[email protected]> wrote:
> >
> > Hello,
> >
> > I'm now trying to understand the Apache Arrow format for my application.
> > Is there a format specification document including meta-data layout?
> >
> > I checked out the description at:
> > https://github.com/apache/arrow/tree/master/docs/source/format
> > https://github.com/apache/arrow/tree/master/format
> >
> > The format/IPC.rst says an arrow file has the format below:
> >
> > <magic number "ARROW1">
> > <empty padding bytes [to 8 byte boundary]>
> > <STREAMING FORMAT>
> > <FOOTER>
> > <FOOTER SIZE: int32>
> > <magic number "ARROW1">
> >
> > Then, STREAMING FORMAT begins from SCHEMA-message.
> > The message chunk has the format below:
> >
> > <metadata_size: int32>
> > <metadata_flatbuffer: bytes>
> > <padding>
> > <message body>
> >
> > I made an arrow file using pyarrow [*1]. It has the following binary.
> >
> > [kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
> > 000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
> > 000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
> > 000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
> > 000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
> > 000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
> > 000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
> > 000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
> > 000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
> > 000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
> > 000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
> > 0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
> > 0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
> > 0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
> > 0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
> > 0000e0  20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20
> > 0000f0  22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61
> >
> > The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420)
> > that is reasonable for SCHEMA-message length.
> > The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer.
> > The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip
> > next 16bytes, expecting message body begins at 0x000020.
>
> The Schema message has no message body -- it is all in the metadata
> (i.e. as a Flatbuffer). Take a look at the the C++ implementation
>
> * File preamble plus padding:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L946
> * Write schema
>    * from here 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L790
>    * to here 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L939
>
> The flatbuffer size 1420 is written here
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L954
>
> followed by the Schema Flatbuffer message
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L957
>
> followed by any padding bytes
>
> Thus, the file layout should look like this
>
> 8 bytes: preamble
> BEGIN MESSAGE 1 (Schema)
> 4 bytes: metadata size
> 1420 bytes: metadata (as described in Message.fbs) plus padding
> 0 bytes: body (Schema has no body)
>
> I would be happy to clarify the specification document to make this
> more clear if you can suggest some improvements.
>
> - Wes
>
> > However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03
> > (= RecordBatch, not Schema!), and the following 64bit is 
> > 0x0a000000000010(!).
> > It is obviously I'm understanding incorrectly.
> >
> > Is there documentation stuff to introduce detailed layout of the arrow 
> > format?
> >
> > Thanks,
> >
> > [*1] Steps to make a sample arrow file
> > $ python3.5
> > >>> import pyarrow as pa
> > >>> import pandas as pd
> > >>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", 
> > >>> con="postgresql://localhost/postgres")
> > >>> Y = pa.Table.from_pandas(X)
> > >>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema)
> > >>> f.write_table(Y)
> > >>> f.close()
> >
> > --
> > HeteroDB, Inc / The PG-Strom Project
> > KaiGai Kohei <[email protected]>



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <[email protected]>

Re: Format specification document?

Reply via email to