Repository: arrow Updated Branches: refs/heads/master 282103012 -> 085c8754b
ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases cc @julienledem @nongli @jacques-n. I am hoping to close the loop on our discussion in https://issues.apache.org/jira/browse/ARROW-81. In my applications, I need the flexibility to transmit: * Dictionaries encoded in signed integers smaller than int32. For example, with 10 dictionary values, we may send int8 indices * Indicator that the dictionary is ordered These features are needed for Python and R support, and in general for statistical computing applications. Author: Wes McKinney <wes.mckin...@twosigma.com> Closes #297 from wesm/ARROW-81 and squashes the following commits: c960bac [Wes McKinney] Augment dictionary encoding metadata to accommodate additional use cases Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/085c8754 Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/085c8754 Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/085c8754 Branch: refs/heads/master Commit: 085c8754b0ab2da7fcd245fc88bc4de9a6806a4c Parents: 2821030 Author: Wes McKinney <wes.mckin...@twosigma.com> Authored: Mon Jan 23 09:13:39 2017 -0500 Committer: Wes McKinney <wes.mckin...@twosigma.com> Committed: Mon Jan 23 09:13:39 2017 -0500 ---------------------------------------------------------------------- format/Message.fbs | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/085c8754/format/Message.fbs ---------------------------------------------------------------------- diff --git a/format/Message.fbs b/format/Message.fbs index b2c6464..028c56a 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -151,6 +151,26 @@ table KeyValue { } /// ---------------------------------------------------------------------- +/// Dictionary encoding metadata + +table DictionaryEncoding { + /// The known dictionary id in the application where this data is used. In + /// the file or streaming formats, the dictionary ids are found in the + /// DictionaryBatch messages + id: long; + + /// The dictionary indices are constrained to be positive integers. If this + /// field is null, the indices must be signed int32 + indexType: Int; + + /// By default, dictionaries are not ordered, or the order does not have + /// semantic meaning. In some statistical, applications, dictionary-encoding + /// is used to represent ordered categorical data, and we provide a way to + /// preserve that metadata here + isOrdered: bool; +} + +/// ---------------------------------------------------------------------- /// A field represents a named column in a record / row batch or child of a /// nested type. /// @@ -163,9 +183,10 @@ table Field { name: string; nullable: bool; type: Type; - // present only if the field is dictionary encoded - // will point to a dictionary provided by a DictionaryBatch message - dictionary: long; + + // Present only if the field is dictionary encoded + dictionary: DictionaryEncoding; + // children apply only to Nested data types like Struct, List and Union children: [Field]; /// layout of buffers produced for this type (as derived from the Type)