[ 
https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238657#comment-16238657
 ] 

ASF GitHub Bot commented on ARROW-1727:
---------------------------------------

wesm closed pull request #1257: ARROW-1727: [Format] Expand Arrow streaming 
format to permit deltas / additions to existing dictionaries
URL: https://github.com/apache/arrow/pull/1257
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/format/IPC.md b/format/IPC.md
index 2f7903144..f3b48854c 100644
--- a/format/IPC.md
+++ b/format/IPC.md
@@ -67,7 +67,9 @@ We provide a streaming format for record batches. It is 
presented as a sequence
 of encapsulated messages, each of which follows the format above. The schema
 comes first in the stream, and it is the same for all of the record batches
 that follow. If any fields in the schema are dictionary-encoded, one or more
-`DictionaryBatch` messages will follow the schema.
+`DictionaryBatch` messages will be included. `DictionaryBatch` and
+`RecordBatch` messages may be interleaved, but before any dictionary key is 
used
+in a `RecordBatch` it should be defined in a `DictionaryBatch`.
 
 ```
 <SCHEMA>
@@ -76,6 +78,10 @@ that follow. If any fields in the schema are 
dictionary-encoded, one or more
 <DICTIONARY k - 1>
 <RECORD BATCH 0>
 ...
+<DICTIONARY x DELTA>
+...
+<DICTIONARY y DELTA>
+...
 <RECORD BATCH n - 1>
 <EOS [optional]: int32>
 ```
@@ -109,6 +115,10 @@ Schematically we have:
 <magic number "ARROW1">
 ```
 
+In the file format, there is no requirement that dictionary keys should be
+defined in a `DictionaryBatch` before they are used in a `RecordBatch`, as long
+as the keys are defined somewhere in the file.
+
 ### RecordBatch body structure
 
 The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of
@@ -181,6 +191,7 @@ the dictionaries can be properly interpreted.
 table DictionaryBatch {
   id: long;
   data: RecordBatch;
+  isDelta: boolean = false;
 }
 ```
 
@@ -189,6 +200,38 @@ in the schema, so that dictionaries can even be used for 
multiple fields. See
 the [Physical Layout][4] document for more about the semantics of
 dictionary-encoded data.
 
+The dictionary `isDelta` flag allows dictionary batches to be modified
+mid-stream.  A dictionary batch with `isDelta` set indicates that its vector
+should be concatenated with those of any previous batches with the same `id`. A
+stream which encodes one column, the list of strings
+`["A", "B", "C", "B", "D", "C", "E", "A"]`, with a delta dictionary batch could
+take the form:
+
+```
+<SCHEMA>
+<DICTIONARY 0>
+(0) "A"
+(1) "B"
+(2) "C"
+
+<RECORD BATCH 0>
+0
+1
+2
+1
+
+<DICTIONARY 0 DELTA>
+(3) "D"
+(4) "E"
+
+<RECORD BATCH 1>
+3
+2
+4
+0
+EOS
+```
+
 ### Tensor (Multi-dimensional Array) Message Format
 
 The `Tensor` message types provides a way to write a multidimensional array of
diff --git a/format/Layout.md b/format/Layout.md
index ebf93821a..963202f9f 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -615,9 +615,9 @@ the the types array indicates that a slot contains a 
different type at the index
 ## Dictionary encoding
 
 When a field is dictionary encoded, the values are represented by an array of 
Int32 representing the index of the value in the dictionary.
-The Dictionary is received as a DictionaryBatch whose id is referenced by a 
dictionary attribute defined in the metadata ([Message.fbs][7]) in the Field 
table.
-The dictionary has the same layout as the type of the field would dictate. 
Each entry in the dictionary can be accessed by its index in the 
DictionaryBatch.
-When a Schema references a Dictionary id, it must send a DictionaryBatch for 
this id before any RecordBatch.
+The Dictionary is received as one or more DictionaryBatches with the id 
referenced by a dictionary attribute defined in the metadata ([Message.fbs][7]) 
in the Field table.
+The dictionary has the same layout as the type of the field would dictate. 
Each entry in the dictionary can be accessed by its index in the 
DictionaryBatches.
+When a Schema references a Dictionary id, it must send at least one 
DictionaryBatch for this id.
 
 As an example, you could have the following data:
 ```
diff --git a/format/Message.fbs b/format/Message.fbs
index f4a95713c..830718139 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -61,16 +61,20 @@ table RecordBatch {
   buffers: [Buffer];
 }
 
-/// ----------------------------------------------------------------------
 /// For sending dictionary encoding information. Any Field can be
 /// dictionary-encoded, but in this case none of its children may be
 /// dictionary-encoded.
-/// There is one vector / column per dictionary
-///
+/// There is one vector / column per dictionary, but that vector / column
+/// may be spread across multiple dictionary batches by using the isDelta
+/// flag
 
 table DictionaryBatch {
   id: long;
   data: RecordBatch;
+
+  /// If isDelta is true the values in the dictionary are to be appended to a
+  /// dictionary with the indicated id
+  isDelta: bool = false;
 }
 
 /// ----------------------------------------------------------------------


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Format] Expand Arrow streaming format to permit new dictionaries and deltas 
> / additions to existing dictionaries
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1727
>                 URL: https://issues.apache.org/jira/browse/ARROW-1727
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>            Reporter: Wes McKinney
>            Assignee: Brian Hulette
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to