[
https://issues.apache.org/jira/browse/ARROW-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238657#comment-16238657
]
ASF GitHub Bot commented on ARROW-1727:
---------------------------------------
wesm closed pull request #1257: ARROW-1727: [Format] Expand Arrow streaming
format to permit deltas / additions to existing dictionaries
URL: https://github.com/apache/arrow/pull/1257
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/format/IPC.md b/format/IPC.md
index 2f7903144..f3b48854c 100644
--- a/format/IPC.md
+++ b/format/IPC.md
@@ -67,7 +67,9 @@ We provide a streaming format for record batches. It is
presented as a sequence
of encapsulated messages, each of which follows the format above. The schema
comes first in the stream, and it is the same for all of the record batches
that follow. If any fields in the schema are dictionary-encoded, one or more
-`DictionaryBatch` messages will follow the schema.
+`DictionaryBatch` messages will be included. `DictionaryBatch` and
+`RecordBatch` messages may be interleaved, but before any dictionary key is
used
+in a `RecordBatch` it should be defined in a `DictionaryBatch`.
```
<SCHEMA>
@@ -76,6 +78,10 @@ that follow. If any fields in the schema are
dictionary-encoded, one or more
<DICTIONARY k - 1>
<RECORD BATCH 0>
...
+<DICTIONARY x DELTA>
+...
+<DICTIONARY y DELTA>
+...
<RECORD BATCH n - 1>
<EOS [optional]: int32>
```
@@ -109,6 +115,10 @@ Schematically we have:
<magic number "ARROW1">
```
+In the file format, there is no requirement that dictionary keys should be
+defined in a `DictionaryBatch` before they are used in a `RecordBatch`, as long
+as the keys are defined somewhere in the file.
+
### RecordBatch body structure
The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of
@@ -181,6 +191,7 @@ the dictionaries can be properly interpreted.
table DictionaryBatch {
id: long;
data: RecordBatch;
+ isDelta: boolean = false;
}
```
@@ -189,6 +200,38 @@ in the schema, so that dictionaries can even be used for
multiple fields. See
the [Physical Layout][4] document for more about the semantics of
dictionary-encoded data.
+The dictionary `isDelta` flag allows dictionary batches to be modified
+mid-stream. A dictionary batch with `isDelta` set indicates that its vector
+should be concatenated with those of any previous batches with the same `id`. A
+stream which encodes one column, the list of strings
+`["A", "B", "C", "B", "D", "C", "E", "A"]`, with a delta dictionary batch could
+take the form:
+
+```
+<SCHEMA>
+<DICTIONARY 0>
+(0) "A"
+(1) "B"
+(2) "C"
+
+<RECORD BATCH 0>
+0
+1
+2
+1
+
+<DICTIONARY 0 DELTA>
+(3) "D"
+(4) "E"
+
+<RECORD BATCH 1>
+3
+2
+4
+0
+EOS
+```
+
### Tensor (Multi-dimensional Array) Message Format
The `Tensor` message types provides a way to write a multidimensional array of
diff --git a/format/Layout.md b/format/Layout.md
index ebf93821a..963202f9f 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -615,9 +615,9 @@ the the types array indicates that a slot contains a
different type at the index
## Dictionary encoding
When a field is dictionary encoded, the values are represented by an array of
Int32 representing the index of the value in the dictionary.
-The Dictionary is received as a DictionaryBatch whose id is referenced by a
dictionary attribute defined in the metadata ([Message.fbs][7]) in the Field
table.
-The dictionary has the same layout as the type of the field would dictate.
Each entry in the dictionary can be accessed by its index in the
DictionaryBatch.
-When a Schema references a Dictionary id, it must send a DictionaryBatch for
this id before any RecordBatch.
+The Dictionary is received as one or more DictionaryBatches with the id
referenced by a dictionary attribute defined in the metadata ([Message.fbs][7])
in the Field table.
+The dictionary has the same layout as the type of the field would dictate.
Each entry in the dictionary can be accessed by its index in the
DictionaryBatches.
+When a Schema references a Dictionary id, it must send at least one
DictionaryBatch for this id.
As an example, you could have the following data:
```
diff --git a/format/Message.fbs b/format/Message.fbs
index f4a95713c..830718139 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -61,16 +61,20 @@ table RecordBatch {
buffers: [Buffer];
}
-/// ----------------------------------------------------------------------
/// For sending dictionary encoding information. Any Field can be
/// dictionary-encoded, but in this case none of its children may be
/// dictionary-encoded.
-/// There is one vector / column per dictionary
-///
+/// There is one vector / column per dictionary, but that vector / column
+/// may be spread across multiple dictionary batches by using the isDelta
+/// flag
table DictionaryBatch {
id: long;
data: RecordBatch;
+
+ /// If isDelta is true the values in the dictionary are to be appended to a
+ /// dictionary with the indicated id
+ isDelta: bool = false;
}
/// ----------------------------------------------------------------------
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Format] Expand Arrow streaming format to permit new dictionaries and deltas
> / additions to existing dictionaries
> -----------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-1727
> URL: https://issues.apache.org/jira/browse/ARROW-1727
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Format
> Reporter: Wes McKinney
> Assignee: Brian Hulette
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)