alamb commented on code in PR #5488:
URL: https://github.com/apache/arrow-rs/pull/5488#discussion_r1518836511
##########
arrow-flight/src/encode.rs:
##########
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder {
/// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
///
/// [`DictionaryArray`]: arrow_array::DictionaryArray
+///
+/// In the arrow flight protocol dictionary values and keys are sent as two
separate messages.
+/// When a sender is encoding a [`RecordBatch`] containing ['DictionaryArray']
columns, it will
+/// first send a dictionary batch (a batch with header
`MessageHeader::DictionaryBatch`) containing
+/// the dictionary values. The receiver is responsible for reading this batch
and maintaining state that associates
+/// those dictionary values with the corresponding array using the `dict_id`
as a key.
+///
+/// After sending the dictionary batch the sender will send the array data in
a batch with header `MessageHeader::RecordBatch`.
+/// For any dictionary array batches in this message, the encoded flight
message will only contain the dictionary keys. The receiver
+/// is then responsible for rebuilding the `DictionaryArray` on the client
side using the dictionary values from the DictionaryBatch message
+/// and the keys from the RecordBatch message.
+///
+/// For example, if we have a batch with a `TypedDictionaryArray<'_,
UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the
+/// values are `String`), then the DictionaryBatch will contain a
`StringArray` and the RecordBatch will contain a `UInt32Array`.
+///
+/// Not that since `dict_id` defined in the `Schema` is used as a key to
assicated dictionary values to their arrays it is required that each
+/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`.
#[derive(Debug, PartialEq)]
pub enum DictionaryHandling {
- /// Expands to the underlying type (default). This likely sends more data
- /// over the network but requires less memory (dictionaries are not
tracked)
- /// and is more compatible with other arrow flight client implementations
- /// that may not support `DictionaryEncoding`
- ///
- /// An IPC response, streaming or otherwise, defines its schema up front
- /// which defines the mapping from dictionary IDs. It then sends these
- /// dictionaries over the wire.
+ /// This method should be used if all batches over the entire lifetime fo
the flight stream
Review Comment:
I agree it was mostly historical
The last time I checked (maybe 2 years ago) at least one client (maybe the
golang one) didn't handle dictionaries (so if the rust server sent
DictionaryArray the client couldn't deal with it) which was important for us as
well
All in all, sorting out how to handle Dictionaries better / more rationally
in arrow-flight would be a very nice improvement I think (aka working on
https://github.com/apache/arrow-rs/issues/3389)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]