Repository: arrow Updated Branches: refs/heads/master acbda1893 -> ddda3039e
ARROW-526: [Format] Revise Format documents for evolution in IPC stream / file / tensor formats Author: Wes McKinney <wes.mckin...@twosigma.com> Closes #515 from wesm/ARROW-526 and squashes the following commits: 6a38432 [Wes McKinney] Typo 5d564a6 [Wes McKinney] Revise Format documents for evolution in IPC stream / file / tensor formats Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/ddda3039 Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/ddda3039 Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/ddda3039 Branch: refs/heads/master Commit: ddda3039e6fb6a9d4f2c5b1189369204bfe1ea93 Parents: acbda18 Author: Wes McKinney <wes.mckin...@twosigma.com> Authored: Mon Apr 10 08:30:27 2017 -0400 Committer: Wes McKinney <wes.mckin...@twosigma.com> Committed: Mon Apr 10 08:30:27 2017 -0400 ---------------------------------------------------------------------- format/IPC.md | 131 +++++++++++++++++++++++++++++++++++------------- format/Metadata.md | 57 ++++++++++++++++++--- 2 files changed, 146 insertions(+), 42 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/ddda3039/format/IPC.md ---------------------------------------------------------------------- diff --git a/format/IPC.md b/format/IPC.md index d386e60..f0a67e2 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -14,65 +14,106 @@ # Interprocess messaging / communication (IPC) -## File format +## Encapsulated message format + +Data components in the stream and file formats are represented as encapsulated +*messages* consisting of: -We define a self-contained "file format" containing an Arrow schema along with -one or more record batches defining a dataset. See [format/File.fbs][1] for the -precise details of the file metadata. +* A length prefix indicating the metadata size +* The message metadata as a [Flatbuffer][3] +* Padding bytes to an 8-byte boundary +* The message body -In general, the file looks like: +Schematically, we have: ``` -<magic number "ARROW1"> -<empty padding bytes [to 8 byte boundary]> +<metadata_size: int32> +<metadata_flatbuffer: bytes> +<padding> +<message body> +``` + +The `metadata_size` includes the size of the flatbuffer plus padding. The +`Message` flatbuffer includes a version number, the particular message (as a +flatbuffer union), and the size of the message body: + +``` +table Message { + version: org.apache.arrow.flatbuf.MetadataVersion; + header: MessageHeader; + bodyLength: long; +} +``` + +Currently, we support 4 types of messages: + +* Schema +* RecordBatch +* DictionaryBatch +* Tensor + +## Streaming format + +We provide a streaming format for record batches. It is presented as a sequence +of encapsulated messages, each of which follows the format above. The schema +comes first in the stream, and it is the same for all of the record batches +that follow. If any fields in the schema are dictionary-encoded, one or more +`DictionaryBatch` messages will follow the schema. + +``` +<SCHEMA> <DICTIONARY 0> ... <DICTIONARY k - 1> <RECORD BATCH 0> ... <RECORD BATCH n - 1> -<METADATA org.apache.arrow.flatbuf.Footer> -<metadata_size: int32> -<magic number "ARROW1"> +<EOS [optional]: int32> ``` -See the File.fbs document for details about the Flatbuffers metadata. The -record batches have a particular structure, defined next. +When a stream reader implementation is reading a stream, after each message, it +may read the next 4 bytes to know how large the message metadata that follows +is. Once the message flatbuffer is read, you can then read the message body. + +The stream writer can signal end-of-stream (EOS) either by writing a 0 length +as an `int32` or simply closing the stream interface. + +## File format -### Record batches +We define a "file format" supporting random access in a very similar format to +the streaming format. The file starts and ends with a magic string `ARROW1` +(plus padding). What follows in the file is identical to the stream format. At +the end of the file, we write a *footer* including offsets and sizes for each +of the data blocks in the file, so that random access is possible. See +[format/File.fbs][1] for the precise details of the file footer. -The record batch metadata is written as a flatbuffer (see -[format/Message.fbs][2] -- the RecordBatch message type) prefixed by its size, -followed by each of the memory buffers in the batch written end to end (with -appropriate alignment and padding): +Schematically we have: ``` -<int32: metadata flatbuffer size> -<metadata: org.apache.arrow.flatbuf.RecordBatch> -<padding bytes [to 8-byte boundary]> -<body: buffers end to end> +<magic number "ARROW1"> +<empty padding bytes [to 8 byte boundary]> +<STREAMING FORMAT> +<FOOTER> +<FOOTER SIZE: int32> +<magic number "ARROW1"> ``` +### RecordBatch body structure + The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of field metadata and physical memory buffers (some comments from [Message.fbs][2] have been shortened / removed): ``` table RecordBatch { - length: int; + length: long; nodes: [FieldNode]; buffers: [Buffer]; } struct FieldNode { - /// The number of value slots in the Arrow array at this level of a nested - /// tree - length: int; - - /// The number of observed nulls. Fields with null_count == 0 may choose not - /// to write their physical validity bitmap out as a materialized buffer, - /// instead setting the length of the bitmap buffer to 0. - null_count: int; + length: long; + null_count: long; } struct Buffer { @@ -91,9 +132,9 @@ struct Buffer { ``` In the context of a file, the `page` is not used, and the `Buffer` offsets use -as a frame of reference the start of the segment where they are written in the -file. So, while in a general IPC setting these offsets may be anyplace in one -or more shared memory regions, in the file format the offsets start from 0. +as a frame of reference the start of the message body. So, while in a general +IPC setting these offsets may be anyplace in one or more shared memory regions, +in the file format the offsets start from 0. The location of a record batch and the size of the metadata block as well as the body of buffers is stored in the file footer: @@ -112,12 +153,30 @@ Some notes about this * The metadata length includes the flatbuffer size, the record batch metadata flatbuffer, and any padding bytes - -### Dictionary batches +### Dictionary Batches Dictionary batches have not yet been implemented, while they are provided for in the metadata. For the time being, the `DICTIONARY` segments shown above in the file do not appear in any of the file implementations. +### Tensor (Multi-dimensional Array) Message Format + +The `Tensor` message types provides a way to write a multidimensional array of +fixed-size values (such as a NumPy ndarray) using Arrow's shared memory +tools. Arrow implementations in general are not required to implement this data +format, though we provide a reference implementation in C++. + +When writing a standalone encapsulated tensor message, we use the format as +indicated above, but additionally align the starting offset (if writing to a +shared memory region) to be a multiple of 8: + +``` +<PADDING> +<metadata size: int32> +<metadata> +<tensor body> +``` + [1]: https://github.com/apache/arrow/blob/master/format/File.fbs -[1]: https://github.com/apache/arrow/blob/master/format/Message.fbs \ No newline at end of file +[2]: https://github.com/apache/arrow/blob/master/format/Message.fbs +[3]: https://github.com/google]/flatbuffers http://git-wip-us.apache.org/repos/asf/arrow/blob/ddda3039/format/Metadata.md ---------------------------------------------------------------------- diff --git a/format/Metadata.md b/format/Metadata.md index a4878f3..18fac52 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -86,8 +86,8 @@ VectorLayout: Type: ``` { - "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|bool|decimal|date|time|timestamp|interval" - // fields as defined in the flatbuff depending on the type name + "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval" + // fields as defined in the Flatbuffer depending on the type name } ``` Union: @@ -126,14 +126,37 @@ Decimal: "scale" : /* integer */ } ``` + Timestamp: + ``` { "name" : "timestamp", "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND" } ``` + +Date: + +``` +{ + "name" : "date", + "unit" : "DAY|MILLISECOND" +} +``` + +Time: + +``` +{ + "name" : "time", + "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND", + "bitWidth": /* integer: 32 or 64 */ +} +``` + Interval: + ``` { "name" : "interval", @@ -161,12 +184,16 @@ Flatbuffers IDL for a record batch data header ``` table RecordBatch { - length: int; + length: long; nodes: [FieldNode]; buffers: [Buffer]; } ``` +The `RecordBatch` metadata provides for record batches with length exceeding +2^31 - 1, but Arrow implementations are not required to implement support +beyond this size. + The `nodes` and `buffers` fields are produced by a depth-first traversal / flattening of a schema (possibly containing nested types) for a given in-memory data set. @@ -205,13 +232,17 @@ hierarchy. struct FieldNode { /// The number of value slots in the Arrow array at this level of a nested /// tree - length: int; + length: long; /// The number of observed nulls. - null_count: int; + null_count: lohng; } ``` +The `FieldNode` metadata provides for fields with length exceeding 2^31 - 1, +but Arrow implementations are not required to implement support for large +arrays. + ## Flattening of nested data Nested types are flattened in the record batch in depth-first order. When @@ -359,7 +390,21 @@ TBD ### Timestamp -TBD +All timestamps are stored as a 64-bit integer, with one of four unit +resolutions: second, millisecond, microsecond, and nanosecond. + +### Date + +We support two different date types: + +* Days since the UNIX epoch as a 32-bit integer +* Milliseconds since the UNIX epoch as a 64-bit integer + +### Time + +Time supports the same unit resolutions: second, millisecond, microsecond, and +nanosecond. We represent time as the smallest integer accommodating the +indicated unit. For second and millisecond: 32-bit, for the others 64-bit. ## Dictionary encoding