tmcw opened a new issue, #4804:
URL: https://github.com/apache/arrow-rs/issues/4804

   **Which part is this question about**
   
   The arrow-rs implementation
   
   **Describe your question**
   
   I've been trying to use arrow-rs to encode a large-ish dataset - about 2GB 
of gzipped JSON that becomes about 200MB of a parquet file. The data is very 
amenable to delta-encoding - it's a time-series capacity dataset. But the 
encoding options provided don't seem to make any difference in output size. 
Maybe I'm connecting the pieces incorrectly?
   
   Here's the most minimal example I've been able to cook up:
   
   Cargo.toml:
   
   ```toml
   [package]
   name = "parquet-demo"
   version = "0.1.0"
   edition = "2021"
   
   # See more keys and their definitions at 
https://doc.rust-lang.org/cargo/reference/manifest.html
   
   [dependencies]
   arrow = "46.0.0"
   arrow-array = "46.0.0"
   parquet = "46.0.0"
   ```
   
   /src/main.rs:
   
   ```rs
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow_array::{builder::PrimitiveBuilder, types::Int32Type, ArrayRef, 
RecordBatch};
   use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
   use std::{fs, path::Path, sync::Arc};
   
   fn main() {
       let path = Path::new("sample.parquet");
   
       let numbers = Field::new("numbers", DataType::Int32.clone(), false);
       let schema = Schema::new(vec![numbers]);
       let file = fs::File::create(&path).unwrap();
   
       let props = WriterProperties::builder()
           .set_encoding(parquet::basic::Encoding::DELTA_BINARY_PACKED)
           .set_compression(parquet::basic::Compression::UNCOMPRESSED)
           
.set_writer_version(parquet::file::properties::WriterVersion::PARQUET_2_0);
   
       let mut writer = ArrowWriter::try_new(file, schema.into(), 
Some(props.build())).unwrap();
   
       let mut numbers = PrimitiveBuilder::<Int32Type>::new();
   
       for j in 0..10000 {
           for _i in 0..10000 {
               numbers.append_value(j);
           }
       }
   
       let batch =
           RecordBatch::try_from_iter(vec![("numbers", 
Arc::new(numbers.finish()) as ArrayRef)])
               .unwrap();
   
       writer.write(&batch).expect("Writing batch");
       writer.close().unwrap();
   }
   ```
   
   Running this produces:
   
   ```
   ➜  parquet-demo git:(main) ✗ cargo run && du -sh sample.parquet
      Compiling parquet-demo v0.1.0 (/Users/tmcw/s/parquet-demo)
       Finished dev [unoptimized + debuginfo] target(s) in 0.41s
        Running `target/debug/parquet-demo`
   136K    sample.parquet
   ```
   
   So, that's with DELTA_BIT_PACKED encoding, and the dataset is lots of 
consecutive identical values - 0000000111111122222 - that kind of thing. It 
should be amenable to delta encoding or RLE. Trying PLAIN encoding:
   
   ```rs
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow_array::{builder::PrimitiveBuilder, types::Int32Type, ArrayRef, 
RecordBatch};
   use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
   use std::{fs, path::Path, sync::Arc};
   
   fn main() {
       let path = Path::new("sample.parquet");
   
       let numbers = Field::new("numbers", DataType::Int32.clone(), false);
       let schema = Schema::new(vec![numbers]);
       let file = fs::File::create(&path).unwrap();
   
       let props = WriterProperties::builder()
           .set_encoding(parquet::basic::Encoding::PLAIN)
           .set_compression(parquet::basic::Compression::UNCOMPRESSED)
           
.set_writer_version(parquet::file::properties::WriterVersion::PARQUET_2_0);
   
       let mut writer = ArrowWriter::try_new(file, schema.into(), 
Some(props.build())).unwrap();
   
       let mut numbers = PrimitiveBuilder::<Int32Type>::new();
   
       for j in 0..10000 {
           for _i in 0..10000 {
               numbers.append_value(j);
           }
       }
   
       let batch =
           RecordBatch::try_from_iter(vec![("numbers", 
Arc::new(numbers.finish()) as ArrayRef)])
               .unwrap();
   
       writer.write(&batch).expect("Writing batch");
       writer.close().unwrap();
   }
   ```
   
   ```
   ➜  parquet-demo git:(main) ✗ cargo run && du -sh sample.parquet
      Compiling parquet-demo v0.1.0 (/Users/tmcw/s/parquet-demo)
       Finished dev [unoptimized + debuginfo] target(s) in 0.39s
        Running `target/debug/parquet-demo`
   136K    sample.parquet
   ```
   
   Same exact size. The same goes if I swap out PLAIN for RLE, or any other 
encoding value: it's always the same size. The same goes for my much larger 
dataset that I'm trying to encode.
   
   I'm totally new to this domain, so this easily could be that I'm using it 
wrong! I've also tried `.set_column_encoding` with the same result. I don't 
know what's going wrong - any ideas? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to