thinkharderdev commented on code in PR #5488:
URL: https://github.com/apache/arrow-rs/pull/5488#discussion_r1518628430
##########
arrow-flight/src/encode.rs:
##########
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder {
/// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
///
/// [`DictionaryArray`]: arrow_array::DictionaryArray
+///
+/// In the arrow flight protocol dictionary values and keys are sent as two
separate messages.
+/// When a sender is encoding a [`RecordBatch`] containing ['DictionaryArray']
columns, it will
+/// first send a dictionary batch (a batch with header
`MessageHeader::DictionaryBatch`) containing
+/// the dictionary values. The receiver is responsible for reading this batch
and maintaining state that associates
+/// those dictionary values with the corresponding array using the `dict_id`
as a key.
+///
+/// After sending the dictionary batch the sender will send the array data in
a batch with header `MessageHeader::RecordBatch`.
+/// For any dictionary array batches in this message, the encoded flight
message will only contain the dictionary keys. The receiver
+/// is then responsible for rebuilding the `DictionaryArray` on the client
side using the dictionary values from the DictionaryBatch message
+/// and the keys from the RecordBatch message.
+///
+/// For example, if we have a batch with a `TypedDictionaryArray<'_,
UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the
+/// values are `String`), then the DictionaryBatch will contain a
`StringArray` and the RecordBatch will contain a `UInt32Array`.
+///
+/// Not that since `dict_id` defined in the `Schema` is used as a key to
assicated dictionary values to their arrays it is required that each
+/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`.
#[derive(Debug, PartialEq)]
pub enum DictionaryHandling {
- /// Expands to the underlying type (default). This likely sends more data
- /// over the network but requires less memory (dictionaries are not
tracked)
- /// and is more compatible with other arrow flight client implementations
- /// that may not support `DictionaryEncoding`
- ///
- /// An IPC response, streaming or otherwise, defines its schema up front
- /// which defines the mapping from dictionary IDs. It then sends these
- /// dictionaries over the wire.
+ /// This method should be used if all batches over the entire lifetime fo
the flight stream
Review Comment:
Ok yeah, I think this is why I was getting so confused. The original
description of `Hydrate` was correct but because it just doesn't work correctly
for nested dictionary arrays. So I guess we can
1. Fix the issue with nested fields
2. Revert back to original description
But I don't quite understand why `Hydrate` is even an option? It seems
strictly worse than `Resend`? If you wanted to hydrate a dictionary array it
seems like you would do that prior to sending over the flight stream instead of
doing it on the fly in the protocol.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]