vegarsti commented on issue #8016:
URL: https://github.com/apache/arrow-rs/issues/8016#issuecomment-3232029296

   > > In [#8069](https://github.com/apache/arrow-rs/pull/8069) now we can 
write RunArray data to Parquet, but it writes it as plain data, not dictionary 
data. I basically copied what is done for the Dictionary type where I could 
find a `match` clause. I'm guessing what's missing is (at least) some way to 
transform a RunArray to a DictionaryArray before writing it to Parquet? 🤔
   > 
   > I have a feeling that it might still be using dictionary encoding, but it 
would be transparent if you're verifying by reading back the parquet file to 
arrow like is done here:
   > 
   > 
[arrow-rs/parquet/src/arrow/arrow_writer/mod.rs](https://github.com/apache/arrow-rs/blob/09317688974ee757f0ca18d80bcec12cf32f76d2/parquet/src/arrow/arrow_writer/mod.rs#L4332)
   > 
   > Line 4332 in 
[0931768](/apache/arrow-rs/commit/09317688974ee757f0ca18d80bcec12cf32f76d2)
   > 
   >  // Schema of output is plain, not dictionary or REE encoded!! 
   > The reason for this is that, when we create the column encoder, we check 
if the type supports dictionary encoding here:
   > 
   > 
[arrow-rs/parquet/src/column/writer/encoder.rs](https://github.com/apache/arrow-rs/blob/09317688974ee757f0ca18d80bcec12cf32f76d2/parquet/src/column/writer/encoder.rs#L185-L187)
   > 
   > Lines 185 to 187 in 
[0931768](/apache/arrow-rs/commit/09317688974ee757f0ca18d80bcec12cf32f76d2)
   > 
   >  let dict_supported = props.dictionary_enabled(descr.path()) 
   >      && has_dictionary_support(T::get_physical_type(), props); 
   >  let dict_encoder = dict_supported.then(|| 
DictEncoder::new(descr.clone())); 
   > You can verify this using `parquet-tools`. For example, if I did something 
like this
   > 
   >         let run_ends = Int32Array::from_iter_values([5]);
   >         let all_nulls = UInt32Array::from_iter([None]);
   >         let run_arr = RunArray::try_new(&run_ends, &all_nulls).unwrap();
   > 
   > 
   >         let record_batch = RecordBatch::try_new(
   >             Arc::new(Schema::new(vec![
   >                 Field::new(
   >                     "b",
   >                     DataType::RunEndEncoded(
   >                         Arc::new(Field::new("run_ends", DataType::Int32, 
false)),
   >                         Arc::new(Field::new("values", DataType::UInt32, 
true)),
   >                     ),
   >                     true,   
   >                 )
   >             ])),
   >             vec![
   >                 Arc::new(run_arr)
   >             ]
   >         ).unwrap();
   > 
   >         let object_store = 
LocalFileSystem::new_with_prefix("/tmp").unwrap();
   >         let object_writer = 
ParquetObjectWriter::new(Arc::new(object_store), "albert.parquet".into());
   >         let mut arrow_writer = AsyncArrowWriter::try_new_with_options(
   >             object_writer,
   >             record_batch.schema(), 
   >             ArrowWriterOptions::new().with_skip_arrow_metadata(false)
   >         ).unwrap();
   > I'd expect to see something like this:
   > 
   > {
   >   "NumRowGroups": 1,
   >   "RowGroups": [
   >     {
   >       "NumRows": 5,
   >       "TotalByteSize": 38,
   >       "Columns": [
   >         {
   >           "PathInSchema": [
   >             "b"
   >           ],
   >           "Type": "INT32",
   >           "Encodings": [
   >             "PLAIN",
   >             "RLE",
   >             "RLE_DICTIONARY"
   >           ],
   >           "CompressedSize": 38,
   >           "UncompressedSize": 38,
   >           "NumValues": 5,
   >           "NullCount": 5,
   >           "CompressionCodec": "UNCOMPRESSED"
   >         }
   >       ]
   >     }
   >   ]
   > }
   
   Amazing, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to