Re: [PR] Add more detailed documentation for `arrow_flight::encode::Dictionary` [arrow-rs]

via GitHub Sat, 09 Mar 2024 08:40:09 -0800


alamb commented on code in PR #5488:
URL: https://github.com/apache/arrow-rs/pull/5488#discussion_r1518618821



##########
arrow-flight/src/encode.rs:
##########
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder {
 /// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
 ///
 /// [`DictionaryArray`]: arrow_array::DictionaryArray
+///
+/// In the arrow flight protocol dictionary values and keys are sent as two 
separate messages.
+/// When a sender is encoding a [`RecordBatch`] containing ['DictionaryArray'] 
columns, it will
+/// first send a dictionary batch (a batch with header 
`MessageHeader::DictionaryBatch`) containing
+/// the dictionary values. The receiver is responsible for reading this batch 
and maintaining state that associates
+/// those dictionary values with the corresponding array using the `dict_id` 
as a key.
+///
+/// After sending the dictionary batch the sender will send the array data in 
a batch with header `MessageHeader::RecordBatch`.
+/// For any dictionary array batches in this message, the encoded flight 
message will only contain the dictionary keys. The receiver
+/// is then responsible for rebuilding the `DictionaryArray` on the client 
side using the dictionary values from the DictionaryBatch message
+/// and the keys from the RecordBatch message.
+///
+/// For example, if we have a batch with a `TypedDictionaryArray<'_, 
UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the
+/// values are `String`), then the DictionaryBatch will contain a 
`StringArray` and the RecordBatch will contain a `UInt32Array`.
+///
+/// Not that since `dict_id` defined in the `Schema` is used as a key to 
assicated dictionary values to their arrays it is required that each
+/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`.
 #[derive(Debug, PartialEq)]
 pub enum DictionaryHandling {
-    /// Expands to the underlying type (default). This likely sends more data
-    /// over the network but requires less memory (dictionaries are not 
tracked)
-    /// and is more compatible with other arrow flight client implementations
-    /// that may not support `DictionaryEncoding`
-    ///
-    /// An IPC response, streaming or otherwise, defines its schema up front
-    /// which defines the mapping from dictionary IDs. It then sends these
-    /// dictionaries over the wire.
+    /// This method should be used if all batches over the entire lifetime fo 
the flight stream
+    /// share the same dictionary (as determined by pointer-equality of the 
`DictionaryArray`'s `values` array).
     ///
-    /// This requires identifying the different dictionaries in use, assigning
-    /// them IDs, and sending new dictionaries, delta or otherwise, when needed
+    /// The shared dictionary will be sent only once and the encoder will 
error in the case where it detects
+    /// a dictionary which is not pointer-equal to the initial dictionary.
     ///
-    /// See also:
-    /// * <https://github.com/apache/arrow-rs/issues/1206>
+    /// The arrow flight protocol supports delta dictionaries where the sender 
can send new dictionary values
+    /// incrementally but this is currently not supported in arrow-rs.
     Hydrate,
-    /// Send dictionary FlightData with every RecordBatch that contains a
-    /// [`DictionaryArray`]. See [`Self::Hydrate`] for more tradeoffs. No
-    /// attempt is made to skip sending the same (logical) dictionary values
-    /// twice.
+    /// This method should be used if each batch may contain different 
dictionary values.  This will require more data to be sent

Review Comment:
   I think it would be good to mention the usecase of the client not supporting 
dictionary arrays (which was our original motivation for using this in InfluxDB 
as I recall)
   
   ```suggestion
       /// This method should be used if each batch may contain different 
dictionary values, or if the client
       /// does not support dictionary arrays. Hydrating will require more data 
to be sent
   ```



##########
arrow-flight/src/encode.rs:
##########
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder {
 /// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
 ///
 /// [`DictionaryArray`]: arrow_array::DictionaryArray
+///
+/// In the arrow flight protocol dictionary values and keys are sent as two 
separate messages.
+/// When a sender is encoding a [`RecordBatch`] containing ['DictionaryArray'] 
columns, it will
+/// first send a dictionary batch (a batch with header 
`MessageHeader::DictionaryBatch`) containing
+/// the dictionary values. The receiver is responsible for reading this batch 
and maintaining state that associates
+/// those dictionary values with the corresponding array using the `dict_id` 
as a key.
+///
+/// After sending the dictionary batch the sender will send the array data in 
a batch with header `MessageHeader::RecordBatch`.
+/// For any dictionary array batches in this message, the encoded flight 
message will only contain the dictionary keys. The receiver
+/// is then responsible for rebuilding the `DictionaryArray` on the client 
side using the dictionary values from the DictionaryBatch message
+/// and the keys from the RecordBatch message.
+///
+/// For example, if we have a batch with a `TypedDictionaryArray<'_, 
UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the
+/// values are `String`), then the DictionaryBatch will contain a 
`StringArray` and the RecordBatch will contain a `UInt32Array`.
+///
+/// Not that since `dict_id` defined in the `Schema` is used as a key to 
assicated dictionary values to their arrays it is required that each
+/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`.
 #[derive(Debug, PartialEq)]
 pub enum DictionaryHandling {
-    /// Expands to the underlying type (default). This likely sends more data
-    /// over the network but requires less memory (dictionaries are not 
tracked)
-    /// and is more compatible with other arrow flight client implementations
-    /// that may not support `DictionaryEncoding`
-    ///
-    /// An IPC response, streaming or otherwise, defines its schema up front
-    /// which defines the mapping from dictionary IDs. It then sends these
-    /// dictionaries over the wire.
+    /// This method should be used if all batches over the entire lifetime fo 
the flight stream
+    /// share the same dictionary (as determined by pointer-equality of the 
`DictionaryArray`'s `values` array).
     ///
-    /// This requires identifying the different dictionaries in use, assigning
-    /// them IDs, and sending new dictionaries, delta or otherwise, when needed
+    /// The shared dictionary will be sent only once and the encoder will 
error in the case where it detects
+    /// a dictionary which is not pointer-equal to the initial dictionary.
     ///
-    /// See also:
-    /// * <https://github.com/apache/arrow-rs/issues/1206>
+    /// The arrow flight protocol supports delta dictionaries where the sender 
can send new dictionary values

Review Comment:
   This matches my understanding too



##########
arrow-flight/src/encode.rs:
##########
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder {
 /// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
 ///
 /// [`DictionaryArray`]: arrow_array::DictionaryArray
+///
+/// In the arrow flight protocol dictionary values and keys are sent as two 
separate messages.
+/// When a sender is encoding a [`RecordBatch`] containing ['DictionaryArray'] 
columns, it will
+/// first send a dictionary batch (a batch with header 
`MessageHeader::DictionaryBatch`) containing
+/// the dictionary values. The receiver is responsible for reading this batch 
and maintaining state that associates
+/// those dictionary values with the corresponding array using the `dict_id` 
as a key.
+///
+/// After sending the dictionary batch the sender will send the array data in 
a batch with header `MessageHeader::RecordBatch`.
+/// For any dictionary array batches in this message, the encoded flight 
message will only contain the dictionary keys. The receiver
+/// is then responsible for rebuilding the `DictionaryArray` on the client 
side using the dictionary values from the DictionaryBatch message
+/// and the keys from the RecordBatch message.
+///
+/// For example, if we have a batch with a `TypedDictionaryArray<'_, 
UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the
+/// values are `String`), then the DictionaryBatch will contain a 
`StringArray` and the RecordBatch will contain a `UInt32Array`.
+///
+/// Not that since `dict_id` defined in the `Schema` is used as a key to 
assicated dictionary values to their arrays it is required that each
+/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`.
 #[derive(Debug, PartialEq)]
 pub enum DictionaryHandling {
-    /// Expands to the underlying type (default). This likely sends more data
-    /// over the network but requires less memory (dictionaries are not 
tracked)
-    /// and is more compatible with other arrow flight client implementations
-    /// that may not support `DictionaryEncoding`
-    ///
-    /// An IPC response, streaming or otherwise, defines its schema up front
-    /// which defines the mapping from dictionary IDs. It then sends these
-    /// dictionaries over the wire.
+    /// This method should be used if all batches over the entire lifetime fo 
the flight stream

Review Comment:
   I don't think this description is accurate. The behavior is what is 
"supposed" to happen with Flight, but to the best of my knowledge, this 
behavior isn't implemented -- the gap is tracked in 
https://github.com/apache/arrow-rs/issues/3389 but I don't think it is yet 
implemented
   
   I think what `Hydrate` actually does is to send arrays using the underlying 
type -- for example, `DictionaryArray<Int32, Utf8>` would be cast to 
`StringArray` and the `StringArray` would be sent over the network
   
   I believe `Resend` actually resends a new dictionary for each 
DictionaryArray, without trying to figure out if it was the same as the 
previous dictionary
   
   



##########
arrow-flight/src/encode.rs:
##########
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder {
 /// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
 ///
 /// [`DictionaryArray`]: arrow_array::DictionaryArray
+///
+/// In the arrow flight protocol dictionary values and keys are sent as two 
separate messages.
+/// When a sender is encoding a [`RecordBatch`] containing ['DictionaryArray'] 
columns, it will
+/// first send a dictionary batch (a batch with header 
`MessageHeader::DictionaryBatch`) containing
+/// the dictionary values. The receiver is responsible for reading this batch 
and maintaining state that associates
+/// those dictionary values with the corresponding array using the `dict_id` 
as a key.
+///
+/// After sending the dictionary batch the sender will send the array data in 
a batch with header `MessageHeader::RecordBatch`.
+/// For any dictionary array batches in this message, the encoded flight 
message will only contain the dictionary keys. The receiver
+/// is then responsible for rebuilding the `DictionaryArray` on the client 
side using the dictionary values from the DictionaryBatch message
+/// and the keys from the RecordBatch message.
+///
+/// For example, if we have a batch with a `TypedDictionaryArray<'_, 
UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the
+/// values are `String`), then the DictionaryBatch will contain a 
`StringArray` and the RecordBatch will contain a `UInt32Array`.
+///
+/// Not that since `dict_id` defined in the `Schema` is used as a key to 
assicated dictionary values to their arrays it is required that each
+/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`.
 #[derive(Debug, PartialEq)]
 pub enum DictionaryHandling {
-    /// Expands to the underlying type (default). This likely sends more data
-    /// over the network but requires less memory (dictionaries are not 
tracked)
-    /// and is more compatible with other arrow flight client implementations
-    /// that may not support `DictionaryEncoding`
-    ///
-    /// An IPC response, streaming or otherwise, defines its schema up front
-    /// which defines the mapping from dictionary IDs. It then sends these
-    /// dictionaries over the wire.
+    /// This method should be used if all batches over the entire lifetime fo 
the flight stream
+    /// share the same dictionary (as determined by pointer-equality of the 
`DictionaryArray`'s `values` array).
     ///
-    /// This requires identifying the different dictionaries in use, assigning
-    /// them IDs, and sending new dictionaries, delta or otherwise, when needed
+    /// The shared dictionary will be sent only once and the encoder will 
error in the case where it detects
+    /// a dictionary which is not pointer-equal to the initial dictionary.
     ///
-    /// See also:
-    /// * <https://github.com/apache/arrow-rs/issues/1206>
+    /// The arrow flight protocol supports delta dictionaries where the sender 
can send new dictionary values
+    /// incrementally but this is currently not supported in arrow-rs.
     Hydrate,
-    /// Send dictionary FlightData with every RecordBatch that contains a
-    /// [`DictionaryArray`]. See [`Self::Hydrate`] for more tradeoffs. No
-    /// attempt is made to skip sending the same (logical) dictionary values
-    /// twice.
+    /// This method should be used if each batch may contain different 
dictionary values.  This will require more data to be sent

Review Comment:
   I think it would be good to mention the usecase of the client not supporting 
dictionary arrays (which was our original motivation for using this in InfluxDB 
as I recall)
   
   ```suggestion
       /// This method should be used if each batch may contain different 
dictionary values, or if the client
       /// does not support dictionary arrays. Hydrating will require more data 
to be sent
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add more detailed documentation for `arrow_flight::encode::Dictionary` [arrow-rs]

Reply via email to