Hello,

I'm now trying to understand the Apache Arrow format for my application.
Is there a format specification document including meta-data layout?

I checked out the description at:
https://github.com/apache/arrow/tree/master/docs/source/format
https://github.com/apache/arrow/tree/master/format

The format/IPC.rst says an arrow file has the format below:

<magic number "ARROW1">
<empty padding bytes [to 8 byte boundary]>
<STREAMING FORMAT>
<FOOTER>
<FOOTER SIZE: int32>
<magic number "ARROW1">

Then, STREAMING FORMAT begins from SCHEMA-message.
The message chunk has the format below:

<metadata_size: int32>
<metadata_flatbuffer: bytes>
<padding>
<message body>

I made an arrow file using pyarrow [*1]. It has the following binary.

[kaigai@saba ~]$ cat /tmp/sample.arrow | od -Ax -t x1 | head -16
000000  41 52 52 4f 57 31 00 00 8c 05 00 00 10 00 00 00
000010  00 00 0a 00 0e 00 06 00 05 00 08 00 0a 00 00 00
000020  00 01 03 00 10 00 00 00 00 00 0a 00 0c 00 00 00
000030  04 00 08 00 0a 00 00 00 ec 03 00 00 04 00 00 00
000040  01 00 00 00 0c 00 00 00 08 00 0c 00 04 00 08 00
000050  08 00 00 00 08 00 00 00 10 00 00 00 06 00 00 00
000060  70 61 6e 64 61 73 00 00 b4 03 00 00 7b 22 70 61
000070  6e 64 61 73 5f 76 65 72 73 69 6f 6e 22 3a 20 22
000080  30 2e 32 32 2e 30 22 2c 20 22 63 6f 6c 75 6d 6e
000090  73 22 3a 20 5b 7b 22 6d 65 74 61 64 61 74 61 22
0000a0  3a 20 6e 75 6c 6c 2c 20 22 6e 75 6d 70 79 5f 74
0000b0  79 70 65 22 3a 20 22 69 6e 74 36 34 22 2c 20 22
0000c0  6e 61 6d 65 22 3a 20 22 69 64 22 2c 20 22 66 69
0000d0  65 6c 64 5f 6e 61 6d 65 22 3a 20 22 69 64 22 2c
0000e0  20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20
0000f0  22 69 6e 74 36 34 22 7d 2c 20 7b 22 6d 65 74 61

The first 64bit is "ARROW1\0\0\0", and the next 32bit is 0x058c (=1420)
that is reasonable for SCHEMA-message length.
The next 32bit is 0x0010 (=16). It may be metadata_size of the FlatBuffer.
The IPC.rst does not mention about FlatBuffer metadata, so I tried to skip
next 16bytes, expecting message body begins at 0x000020.
However, the first 16bit (version) is 0x0001 (=V2), the next byte is 0x03
(= RecordBatch, not Schema!), and the following 64bit is 0x0a000000000010(!).
It is obviously I'm understanding incorrectly.

Is there documentation stuff to introduce detailed layout of the arrow format?

Thanks,

[*1] Steps to make a sample arrow file
$ python3.5
>>> import pyarrow as pa
>>> import pandas as pd
>>> X = pd.read_sql(sql="SELECT * FROM hogehoge LIMIT 1000", 
>>> con="postgresql://localhost/postgres")
>>> Y = pa.Table.from_pandas(X)
>>> f = pa.RecordBatchFileWriter('/tmp/sample.arrow', Y.schema)
>>> f.write_table(Y)
>>> f.close()

-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kai...@heterodb.com>

Reply via email to