Re: [PR] feat: arrow-ipc delta dictionary support [arrow-rs]

via GitHub Wed, 13 Aug 2025 07:12:50 -0700


albertlockett commented on code in PR #8001:
URL: https://github.com/apache/arrow-rs/pull/8001#discussion_r2273576278



##########
arrow-ipc/src/reader.rs:
##########
@@ -1597,6 +1660,122 @@ impl<R: Read> RecordBatchReader for StreamReader<R> {
     }
 }
 
+/// Representation of a fully parsed IpcMessage from the underlying stream.
+/// Parsing this kind of message is done by higher level constructs such as
+/// [`StreamReader`], because fully interpreting the messages into a record
+/// batch or dictionary batch requires access to stream state such as schema
+/// and the full dictionary cache.
+#[derive(Debug)]
+#[allow(dead_code)]

Review Comment:
   I think I'm probably fine with this for now.
   
   FYI, it looks like the `arrow_schema::Schema` contained in the `Schema` 
variant of this enum is never read. Should it be read in the test? If not we 
could also do something like this:
   ```rs
   pub(crate) enum IpcMessage {
       Schema,
       RecordBatch(RecordBatch),
       #[allow(dead_code)]
       DictionaryBatch {
           id: i64,
           is_delta: bool,
           values: ArrayRef,
       },
   }
   ```
   (and of course, modify the `next_ipc_message` and `run_sequence_test` 
accordingly)
   
   Another option could be to have the `maybe_next` method of `StreamReader` 
actually drive the `MessageReader`, and have a separate driver in the 
`run_sequence_test` function which also drives it and produces these 
`IpcMessage`. If we do that, we could maybe also remove the `RecordBatch` 
contained in the `RecordBatch` variant, since it's never read in the test.
   
   The downside of that of course is we duplicate some of the dict handling 
logic here:
   
https://github.com/apache/arrow-rs/blob/34b4c74a187826628be27d0dfba7503104c4ba39/arrow-ipc/src/reader.rs#L1591-L1612
   ```rs
                   let version = message.version();
                   let dict_values = get_dictionary_values(
                       &body.into(),
                       dict,
                       &self.schema,
                       &mut self.dictionaries_by_id,
                       &version,
                       false,
                       self.skip_validation.clone(),
                   )?;
   
                   update_dictionaries(
                       &mut self.dictionaries_by_id,
                       dict.isDelta(),
                       dict.id(),
                       dict_values.clone(),
                   )?;
   ```



##########
arrow-ipc/src/reader.rs:
##########
@@ -1597,6 +1660,122 @@ impl<R: Read> RecordBatchReader for StreamReader<R> {
     }
 }
 
+/// Representation of a fully parsed IpcMessage from the underlying stream.
+/// Parsing this kind of message is done by higher level constructs such as
+/// [`StreamReader`], because fully interpreting the messages into a record
+/// batch or dictionary batch requires access to stream state such as schema
+/// and the full dictionary cache.
+#[derive(Debug)]
+#[allow(dead_code)]

Review Comment:
   I think I'm probably fine with this for now.
   
   FYI, it looks like the `arrow_schema::Schema` contained in the `Schema` 
variant of this enum is never read. Should it be read in the test? If not we 
could also do something like this:
   ```rs
   pub(crate) enum IpcMessage {
       Schema,
       RecordBatch(RecordBatch),
       #[allow(dead_code)]
       DictionaryBatch {
           id: i64,
           is_delta: bool,
           values: ArrayRef,
       },
   }
   ```
   (and of course, modify the `next_ipc_message` and `run_sequence_test` 
accordingly)
   
   Another option could be to have the `maybe_next` method of `StreamReader` 
actually drive the `MessageReader`, and have a separate driver in the 
`run_sequence_test` function which also drives it and produces these 
`IpcMessage`. If we do that, we could maybe also remove the `RecordBatch` 
contained in the `RecordBatch` variant, since it's never read in the test.
   
   The downside of that of course is we duplicate some of the dict handling 
logic here:
   
https://github.com/apache/arrow-rs/blob/34b4c74a187826628be27d0dfba7503104c4ba39/arrow-ipc/src/reader.rs#L1591-L1612



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat: arrow-ipc delta dictionary support [arrow-rs]

Reply via email to