vegarsti commented on issue #8016: URL: https://github.com/apache/arrow-rs/issues/8016#issuecomment-3232029296
> > In [#8069](https://github.com/apache/arrow-rs/pull/8069) now we can write RunArray data to Parquet, but it writes it as plain data, not dictionary data. I basically copied what is done for the Dictionary type where I could find a `match` clause. I'm guessing what's missing is (at least) some way to transform a RunArray to a DictionaryArray before writing it to Parquet? 🤔 > > I have a feeling that it might still be using dictionary encoding, but it would be transparent if you're verifying by reading back the parquet file to arrow like is done here: > > [arrow-rs/parquet/src/arrow/arrow_writer/mod.rs](https://github.com/apache/arrow-rs/blob/09317688974ee757f0ca18d80bcec12cf32f76d2/parquet/src/arrow/arrow_writer/mod.rs#L4332) > > Line 4332 in [0931768](/apache/arrow-rs/commit/09317688974ee757f0ca18d80bcec12cf32f76d2) > > // Schema of output is plain, not dictionary or REE encoded!! > The reason for this is that, when we create the column encoder, we check if the type supports dictionary encoding here: > > [arrow-rs/parquet/src/column/writer/encoder.rs](https://github.com/apache/arrow-rs/blob/09317688974ee757f0ca18d80bcec12cf32f76d2/parquet/src/column/writer/encoder.rs#L185-L187) > > Lines 185 to 187 in [0931768](/apache/arrow-rs/commit/09317688974ee757f0ca18d80bcec12cf32f76d2) > > let dict_supported = props.dictionary_enabled(descr.path()) > && has_dictionary_support(T::get_physical_type(), props); > let dict_encoder = dict_supported.then(|| DictEncoder::new(descr.clone())); > You can verify this using `parquet-tools`. For example, if I did something like this > > let run_ends = Int32Array::from_iter_values([5]); > let all_nulls = UInt32Array::from_iter([None]); > let run_arr = RunArray::try_new(&run_ends, &all_nulls).unwrap(); > > > let record_batch = RecordBatch::try_new( > Arc::new(Schema::new(vec![ > Field::new( > "b", > DataType::RunEndEncoded( > Arc::new(Field::new("run_ends", DataType::Int32, false)), > Arc::new(Field::new("values", DataType::UInt32, true)), > ), > true, > ) > ])), > vec![ > Arc::new(run_arr) > ] > ).unwrap(); > > let object_store = LocalFileSystem::new_with_prefix("/tmp").unwrap(); > let object_writer = ParquetObjectWriter::new(Arc::new(object_store), "albert.parquet".into()); > let mut arrow_writer = AsyncArrowWriter::try_new_with_options( > object_writer, > record_batch.schema(), > ArrowWriterOptions::new().with_skip_arrow_metadata(false) > ).unwrap(); > I'd expect to see something like this: > > { > "NumRowGroups": 1, > "RowGroups": [ > { > "NumRows": 5, > "TotalByteSize": 38, > "Columns": [ > { > "PathInSchema": [ > "b" > ], > "Type": "INT32", > "Encodings": [ > "PLAIN", > "RLE", > "RLE_DICTIONARY" > ], > "CompressedSize": 38, > "UncompressedSize": 38, > "NumValues": 5, > "NullCount": 5, > "CompressionCodec": "UNCOMPRESSED" > } > ] > } > ] > } Amazing, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org