albertlockett commented on issue #8016:
URL: https://github.com/apache/arrow-rs/issues/8016#issuecomment-3229423961

   > In [#8069](https://github.com/apache/arrow-rs/pull/8069) now we can write 
RunArray data to Parquet, but it writes it as plain data, not dictionary data. 
I basically copied what is done for the Dictionary type where I could find a 
`match` clause. I'm guessing what's missing is (at least) some way to transform 
a RunArray to a DictionaryArray before writing it to Parquet? 🤔
   
   I have a feeling that it might still be using dictionary encoding, but it 
would be transparent if you're verifying by reading back the parquet file to 
arrow like is done here:
   
https://github.com/apache/arrow-rs/blob/09317688974ee757f0ca18d80bcec12cf32f76d2/parquet/src/arrow/arrow_writer/mod.rs#L4332
   
   The reason for this is that, when we create the column encoder, we check if 
the type supports dictionary encoding here:
   
https://github.com/apache/arrow-rs/blob/09317688974ee757f0ca18d80bcec12cf32f76d2/parquet/src/column/writer/encoder.rs#L185-L187
   
   You can verify this using `parquet-tools`. For example, if I did something 
like this
   ```rs
           let run_ends = Int32Array::from_iter_values([5]);
           let all_nulls = UInt32Array::from_iter([None]);
           let run_arr = RunArray::try_new(&run_ends, &all_nulls).unwrap();
   
   
           let record_batch = RecordBatch::try_new(
               Arc::new(Schema::new(vec![
                   Field::new(
                       "b",
                       DataType::RunEndEncoded(
                           Arc::new(Field::new("run_ends", DataType::Int32, 
false)),
                           Arc::new(Field::new("values", DataType::UInt32, 
true)),
                       ),
                       true,   
                   )
               ])),
               vec![
                   Arc::new(run_arr)
               ]
           ).unwrap();
   
           let object_store = LocalFileSystem::new_with_prefix("/tmp").unwrap();
           let object_writer = ParquetObjectWriter::new(Arc::new(object_store), 
"albert.parquet".into());
           let mut arrow_writer = AsyncArrowWriter::try_new_with_options(
               object_writer,
               record_batch.schema(), 
               ArrowWriterOptions::new().with_skip_arrow_metadata(false)
           ).unwrap();
   ```
   
   I'd expect to see something like this:
   ```json
   {
     "NumRowGroups": 1,
     "RowGroups": [
       {
         "NumRows": 5,
         "TotalByteSize": 38,
         "Columns": [
           {
             "PathInSchema": [
               "b"
             ],
             "Type": "INT32",
             "Encodings": [
               "PLAIN",
               "RLE",
               "RLE_DICTIONARY"
             ],
             "CompressedSize": 38,
             "UncompressedSize": 38,
             "NumValues": 5,
             "NullCount": 5,
             "CompressionCodec": "UNCOMPRESSED"
           }
         ]
       }
     ]
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to