Repository: arrow Updated Branches: refs/heads/master 8f2b44b89 -> a44155d6e
ARROW-986: [Format] Add brief explanation of dictionary batches in IPC.md Author: Wes McKinney <[email protected]> Closes #732 from wesm/ARROW-986 and squashes the following commits: 4321106 [Wes McKinney] Add brief explanation of dictionary batches in IPC.md Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/a44155d6 Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/a44155d6 Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/a44155d6 Branch: refs/heads/master Commit: a44155d6ec5d0c6c255d3305a494f51a6b1d2316 Parents: 8f2b44b Author: Wes McKinney <[email protected]> Authored: Mon Jun 5 12:20:35 2017 +0200 Committer: Uwe L. Korn <[email protected]> Committed: Mon Jun 5 12:20:35 2017 +0200 ---------------------------------------------------------------------- format/IPC.md | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/a44155d6/format/IPC.md ---------------------------------------------------------------------- diff --git a/format/IPC.md b/format/IPC.md index bf2aaa7..7d68921 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -157,9 +157,24 @@ Some notes about this ### Dictionary Batches -Dictionary batches have not yet been implemented, while they are provided for -in the metadata. For the time being, the `DICTIONARY` segments shown above in -the file do not appear in any of the file implementations. +Dictionaries are written in the stream and file formats as a sequence of record +batches, each having a single field. The complete semantic schema for a +sequence of record batches, therefore, consists of the schema along with all of +the dictionaries. The dictionary types are found in the schema, so it is +necessary to read the schema to first determine the dictionary types so that +the dictionaries can be properly interpreted. + +``` +table DictionaryBatch { + id: long; + data: RecordBatch; +} +``` + +The dictionary `id` in the message metadata can be referenced one or more times +in the schema, so that dictionaries can even be used for multiple fields. See +the [Physical Layout][4] document for more about the semantics of +dictionary-encoded data. ### Tensor (Multi-dimensional Array) Message Format @@ -182,3 +197,4 @@ shared memory region) to be a multiple of 8: [1]: https://github.com/apache/arrow/blob/master/format/File.fbs [2]: https://github.com/apache/arrow/blob/master/format/Message.fbs [3]: https://github.com/google]/flatbuffers +[4]: https://github.com/apache/arrow/blob/master/format/Layout.md
