alamb commented on code in PR #9116:
URL: https://github.com/apache/arrow-rs/pull/9116#discussion_r2677562161
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
/// let mut reader = builder.build().unwrap();
/// let _batch = reader.next().unwrap().unwrap();
/// ```
+ ///
+ /// # Example: Preserving Dictionary Encoding
+ ///
+ /// By default, Parquet string columns are read as `Utf8Array` (or
`LargeUtf8Array`),
+ /// even if the underlying Parquet data uses dictionary encoding. You can
preserve
+ /// the dictionary encoding by specifying a `Dictionary` type in the
schema hint:
+ ///
+ /// ```
+ /// use std::sync::Arc;
Review Comment:
It would be nice to prefix the setup of the example with `# ` so it isn't
rendered in the docs
So instead of
```rust
/// use tempfile::tempfile;
```
Do
```rust
/// # use tempfile::tempfile;
```
That still runs, but will not be shown
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
/// let mut reader = builder.build().unwrap();
/// let _batch = reader.next().unwrap().unwrap();
/// ```
+ ///
+ /// # Example: Preserving Dictionary Encoding
+ ///
+ /// By default, Parquet string columns are read as `Utf8Array` (or
`LargeUtf8Array`),
+ /// even if the underlying Parquet data uses dictionary encoding. You can
preserve
+ /// the dictionary encoding by specifying a `Dictionary` type in the
schema hint:
+ ///
+ /// ```
+ /// use std::sync::Arc;
+ /// use tempfile::tempfile;
+ /// use arrow_array::{ArrayRef, RecordBatch, StringArray};
+ /// use arrow_schema::{DataType, Field, Schema};
+ /// use parquet::arrow::arrow_reader::{ArrowReaderOptions,
ParquetRecordBatchReaderBuilder};
+ /// use parquet::arrow::ArrowWriter;
+ ///
+ /// // Write a Parquet file with string data
+ /// let file = tempfile().unwrap();
+ /// let schema = Arc::new(Schema::new(vec![
+ /// Field::new("city", DataType::Utf8, false)
+ /// ]));
+ /// let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris",
"Berlin", "Paris"]);
+ /// let batch = RecordBatch::try_new(schema.clone(),
vec![Arc::new(cities)]).unwrap();
+ ///
+ /// let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(),
batch.schema(), None).unwrap();
+ /// writer.write(&batch).unwrap();
+ /// writer.close().unwrap();
+ ///
+ /// // Read the file back, requesting dictionary encoding preservation
+ /// let dict_schema = Arc::new(Schema::new(vec![
+ /// Field::new("city", DataType::Dictionary(
+ /// Box::new(DataType::Int32),
+ /// Box::new(DataType::Utf8)
+ /// ), false)
+ /// ]));
+ /// let options = ArrowReaderOptions::new().with_schema(dict_schema);
+ /// let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
+ /// file.try_clone().unwrap(),
+ /// options
+ /// ).unwrap();
+ ///
+ /// // Verify the schema shows Dictionary type
+ /// assert!(matches!(
+ /// builder.schema().field(0).data_type(),
+ /// DataType::Dictionary(_, _)
+ /// ));
+ ///
+ /// let mut reader = builder.build().unwrap();
+ /// let batch = reader.next().unwrap().unwrap();
+ ///
+ /// // The column is now a DictionaryArray
+ /// assert!(matches!(
+ /// batch.column(0).data_type(),
+ /// DataType::Dictionary(_, _)
+ /// ));
+ /// ```
+ ///
+ /// **Note**: Dictionary encoding preservation works best when the batch
size
+ /// is a divisor of the row group size and a single read does not span
multiple
+ /// column chunks. If these conditions are not met, the reader may compute
+ /// a fresh dictionary from the decoded values.
Review Comment:
I think this is somewhat misleading beacuse
1. A single read never spans multiple column chunks
2. the batch size is not related to the dictionary size, as far as I know
Dictionary encoding works best when:
1. The original column was dictionary encoded (the default)
2. There are a small number of distinct values
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
/// let mut reader = builder.build().unwrap();
/// let _batch = reader.next().unwrap().unwrap();
/// ```
+ ///
+ /// # Example: Preserving Dictionary Encoding
+ ///
+ /// By default, Parquet string columns are read as `Utf8Array` (or
`LargeUtf8Array`),
+ /// even if the underlying Parquet data uses dictionary encoding. You can
preserve
+ /// the dictionary encoding by specifying a `Dictionary` type in the
schema hint:
+ ///
+ /// ```
+ /// use std::sync::Arc;
+ /// use tempfile::tempfile;
+ /// use arrow_array::{ArrayRef, RecordBatch, StringArray};
+ /// use arrow_schema::{DataType, Field, Schema};
+ /// use parquet::arrow::arrow_reader::{ArrowReaderOptions,
ParquetRecordBatchReaderBuilder};
+ /// use parquet::arrow::ArrowWriter;
+ ///
+ /// // Write a Parquet file with string data
+ /// let file = tempfile().unwrap();
Review Comment:
I also see that this follows the example above -- I think we could make
these examples smaller (and thus easier to follow) if we wrote into an in
memory `Vec` rather than a file
```rust
let mut file = Vec::new();
...
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(),
None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
...
// now read from the "file"
```
Since what you have here is following the same pattern of the other
examples, I think it is good. Maybe we can improve the examples as a follow on
PR
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
/// let mut reader = builder.build().unwrap();
/// let _batch = reader.next().unwrap().unwrap();
/// ```
+ ///
+ /// # Example: Preserving Dictionary Encoding
+ ///
+ /// By default, Parquet string columns are read as `Utf8Array` (or
`LargeUtf8Array`),
+ /// even if the underlying Parquet data uses dictionary encoding. You can
preserve
+ /// the dictionary encoding by specifying a `Dictionary` type in the
schema hint:
+ ///
+ /// ```
+ /// use std::sync::Arc;
Review Comment:
However, given this example follows the others in this, file, I think it is
fine to keep it this way and we can improve all the examples as a follow on PR
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -508,6 +508,66 @@ impl ArrowReaderOptions {
/// let mut reader = builder.build().unwrap();
/// let _batch = reader.next().unwrap().unwrap();
/// ```
+ ///
+ /// # Example: Preserving Dictionary Encoding
+ ///
+ /// By default, Parquet string columns are read as `Utf8Array` (or
`LargeUtf8Array`),
+ /// even if the underlying Parquet data uses dictionary encoding. You can
preserve
+ /// the dictionary encoding by specifying a `Dictionary` type in the
schema hint:
+ ///
+ /// ```
+ /// use std::sync::Arc;
+ /// use tempfile::tempfile;
+ /// use arrow_array::{ArrayRef, RecordBatch, StringArray};
+ /// use arrow_schema::{DataType, Field, Schema};
+ /// use parquet::arrow::arrow_reader::{ArrowReaderOptions,
ParquetRecordBatchReaderBuilder};
+ /// use parquet::arrow::ArrowWriter;
+ ///
+ /// // Write a Parquet file with string data
+ /// let file = tempfile().unwrap();
+ /// let schema = Arc::new(Schema::new(vec![
+ /// Field::new("city", DataType::Utf8, false)
+ /// ]));
+ /// let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris",
"Berlin", "Paris"]);
+ /// let batch = RecordBatch::try_new(schema.clone(),
vec![Arc::new(cities)]).unwrap();
+ ///
+ /// let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(),
batch.schema(), None).unwrap();
+ /// writer.write(&batch).unwrap();
+ /// writer.close().unwrap();
+ ///
+ /// // Read the file back, requesting dictionary encoding preservation
+ /// let dict_schema = Arc::new(Schema::new(vec![
+ /// Field::new("city", DataType::Dictionary(
+ /// Box::new(DataType::Int32),
+ /// Box::new(DataType::Utf8)
+ /// ), false)
+ /// ]));
+ /// let options = ArrowReaderOptions::new().with_schema(dict_schema);
+ /// let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
+ /// file.try_clone().unwrap(),
+ /// options
+ /// ).unwrap();
+ ///
+ /// // Verify the schema shows Dictionary type
Review Comment:
I think checking the schema type and then the actual record batch type is
redundant -- you could leave out this check and make the example more concise
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]