Re: [PR] docs(parquet): add example for preserving dictionary encoding [arrow-rs]

via GitHub Fri, 09 Jan 2026 12:45:27 -0800


alamb commented on code in PR #9116:
URL: https://github.com/apache/arrow-rs/pull/9116#discussion_r2677562161



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
     /// let mut reader = builder.build().unwrap();
     /// let _batch = reader.next().unwrap().unwrap();
     /// ```
+    ///
+    /// # Example: Preserving Dictionary Encoding
+    ///
+    /// By default, Parquet string columns are read as `Utf8Array` (or 
`LargeUtf8Array`),
+    /// even if the underlying Parquet data uses dictionary encoding. You can 
preserve
+    /// the dictionary encoding by specifying a `Dictionary` type in the 
schema hint:
+    ///
+    /// ```
+    /// use std::sync::Arc;

Review Comment:
   It would be nice to prefix the setup of the example with `# ` so it isn't 
rendered in the docs
   
   
   So instead of
   ```rust
       /// use tempfile::tempfile;
   ```
   
   Do
   ```rust
       /// # use tempfile::tempfile;
   ```
   
   That still runs, but will not be shown



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
     /// let mut reader = builder.build().unwrap();
     /// let _batch = reader.next().unwrap().unwrap();
     /// ```
+    ///
+    /// # Example: Preserving Dictionary Encoding
+    ///
+    /// By default, Parquet string columns are read as `Utf8Array` (or 
`LargeUtf8Array`),
+    /// even if the underlying Parquet data uses dictionary encoding. You can 
preserve
+    /// the dictionary encoding by specifying a `Dictionary` type in the 
schema hint:
+    ///
+    /// ```
+    /// use std::sync::Arc;
+    /// use tempfile::tempfile;
+    /// use arrow_array::{ArrayRef, RecordBatch, StringArray};
+    /// use arrow_schema::{DataType, Field, Schema};
+    /// use parquet::arrow::arrow_reader::{ArrowReaderOptions, 
ParquetRecordBatchReaderBuilder};
+    /// use parquet::arrow::ArrowWriter;
+    ///
+    /// // Write a Parquet file with string data
+    /// let file = tempfile().unwrap();
+    /// let schema = Arc::new(Schema::new(vec![
+    ///     Field::new("city", DataType::Utf8, false)
+    /// ]));
+    /// let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris", 
"Berlin", "Paris"]);
+    /// let batch = RecordBatch::try_new(schema.clone(), 
vec![Arc::new(cities)]).unwrap();
+    ///
+    /// let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), 
batch.schema(), None).unwrap();
+    /// writer.write(&batch).unwrap();
+    /// writer.close().unwrap();
+    ///
+    /// // Read the file back, requesting dictionary encoding preservation
+    /// let dict_schema = Arc::new(Schema::new(vec![
+    ///     Field::new("city", DataType::Dictionary(
+    ///         Box::new(DataType::Int32),
+    ///         Box::new(DataType::Utf8)
+    ///     ), false)
+    /// ]));
+    /// let options = ArrowReaderOptions::new().with_schema(dict_schema);
+    /// let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
+    ///     file.try_clone().unwrap(),
+    ///     options
+    /// ).unwrap();
+    ///
+    /// // Verify the schema shows Dictionary type
+    /// assert!(matches!(
+    ///     builder.schema().field(0).data_type(),
+    ///     DataType::Dictionary(_, _)
+    /// ));
+    ///
+    /// let mut reader = builder.build().unwrap();
+    /// let batch = reader.next().unwrap().unwrap();
+    ///
+    /// // The column is now a DictionaryArray
+    /// assert!(matches!(
+    ///     batch.column(0).data_type(),
+    ///     DataType::Dictionary(_, _)
+    /// ));
+    /// ```
+    ///
+    /// **Note**: Dictionary encoding preservation works best when the batch 
size
+    /// is a divisor of the row group size and a single read does not span 
multiple
+    /// column chunks. If these conditions are not met, the reader may compute
+    /// a fresh dictionary from the decoded values.

Review Comment:
   I think this is somewhat misleading beacuse
   1. A single read never spans multiple column chunks 
   2. the batch size is not related to the dictionary size, as far as I know
   
   Dictionary encoding works best when:
   1. The original column was dictionary encoded (the default)
   2. There are a small number of distinct values
   
   



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
     /// let mut reader = builder.build().unwrap();
     /// let _batch = reader.next().unwrap().unwrap();
     /// ```
+    ///
+    /// # Example: Preserving Dictionary Encoding
+    ///
+    /// By default, Parquet string columns are read as `Utf8Array` (or 
`LargeUtf8Array`),
+    /// even if the underlying Parquet data uses dictionary encoding. You can 
preserve
+    /// the dictionary encoding by specifying a `Dictionary` type in the 
schema hint:
+    ///
+    /// ```
+    /// use std::sync::Arc;
+    /// use tempfile::tempfile;
+    /// use arrow_array::{ArrayRef, RecordBatch, StringArray};
+    /// use arrow_schema::{DataType, Field, Schema};
+    /// use parquet::arrow::arrow_reader::{ArrowReaderOptions, 
ParquetRecordBatchReaderBuilder};
+    /// use parquet::arrow::ArrowWriter;
+    ///
+    /// // Write a Parquet file with string data
+    /// let file = tempfile().unwrap();

Review Comment:
   I also see that this follows the example above -- I think we could make 
these examples smaller (and thus easier to follow) if we wrote into an in 
memory `Vec` rather than a file
   
   ```rust
   let mut file = Vec::new();
   ...
   let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), 
None).unwrap();
   writer.write(&batch).unwrap();
   writer.close().unwrap();
   ...
   // now read from the "file"
   ```
   
   Since what you have here is following the same pattern of the other 
examples,  I think it is good. Maybe we can improve the examples as a follow on 
PR



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
     /// let mut reader = builder.build().unwrap();
     /// let _batch = reader.next().unwrap().unwrap();
     /// ```
+    ///
+    /// # Example: Preserving Dictionary Encoding
+    ///
+    /// By default, Parquet string columns are read as `Utf8Array` (or 
`LargeUtf8Array`),
+    /// even if the underlying Parquet data uses dictionary encoding. You can 
preserve
+    /// the dictionary encoding by specifying a `Dictionary` type in the 
schema hint:
+    ///
+    /// ```
+    /// use std::sync::Arc;

Review Comment:
   However, given this example follows the others in this, file, I think it is 
fine to keep it this way and we can improve all the examples as a follow on PR



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
     /// let mut reader = builder.build().unwrap();
     /// let _batch = reader.next().unwrap().unwrap();
     /// ```
+    ///
+    /// # Example: Preserving Dictionary Encoding
+    ///
+    /// By default, Parquet string columns are read as `Utf8Array` (or 
`LargeUtf8Array`),
+    /// even if the underlying Parquet data uses dictionary encoding. You can 
preserve
+    /// the dictionary encoding by specifying a `Dictionary` type in the 
schema hint:
+    ///
+    /// ```
+    /// use std::sync::Arc;
+    /// use tempfile::tempfile;
+    /// use arrow_array::{ArrayRef, RecordBatch, StringArray};
+    /// use arrow_schema::{DataType, Field, Schema};
+    /// use parquet::arrow::arrow_reader::{ArrowReaderOptions, 
ParquetRecordBatchReaderBuilder};
+    /// use parquet::arrow::ArrowWriter;
+    ///
+    /// // Write a Parquet file with string data
+    /// let file = tempfile().unwrap();
+    /// let schema = Arc::new(Schema::new(vec![
+    ///     Field::new("city", DataType::Utf8, false)
+    /// ]));
+    /// let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris", 
"Berlin", "Paris"]);
+    /// let batch = RecordBatch::try_new(schema.clone(), 
vec![Arc::new(cities)]).unwrap();
+    ///
+    /// let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), 
batch.schema(), None).unwrap();
+    /// writer.write(&batch).unwrap();
+    /// writer.close().unwrap();
+    ///
+    /// // Read the file back, requesting dictionary encoding preservation
+    /// let dict_schema = Arc::new(Schema::new(vec![
+    ///     Field::new("city", DataType::Dictionary(
+    ///         Box::new(DataType::Int32),
+    ///         Box::new(DataType::Utf8)
+    ///     ), false)
+    /// ]));
+    /// let options = ArrowReaderOptions::new().with_schema(dict_schema);
+    /// let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
+    ///     file.try_clone().unwrap(),
+    ///     options
+    /// ).unwrap();
+    ///
+    /// // Verify the schema shows Dictionary type

Review Comment:
   I think checking the schema type and then the actual record batch type is 
redundant -- you could leave out this check and make the example more concise



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs(parquet): add example for preserving dictionary encoding [arrow-rs]

Reply via email to