tmcw opened a new issue, #4804: URL: https://github.com/apache/arrow-rs/issues/4804
**Which part is this question about** The arrow-rs implementation **Describe your question** I've been trying to use arrow-rs to encode a large-ish dataset - about 2GB of gzipped JSON that becomes about 200MB of a parquet file. The data is very amenable to delta-encoding - it's a time-series capacity dataset. But the encoding options provided don't seem to make any difference in output size. Maybe I'm connecting the pieces incorrectly? Here's the most minimal example I've been able to cook up: Cargo.toml: ```toml [package] name = "parquet-demo" version = "0.1.0" edition = "2021" # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html [dependencies] arrow = "46.0.0" arrow-array = "46.0.0" parquet = "46.0.0" ``` /src/main.rs: ```rs use arrow::datatypes::{DataType, Field, Schema}; use arrow_array::{builder::PrimitiveBuilder, types::Int32Type, ArrayRef, RecordBatch}; use parquet::{arrow::ArrowWriter, file::properties::WriterProperties}; use std::{fs, path::Path, sync::Arc}; fn main() { let path = Path::new("sample.parquet"); let numbers = Field::new("numbers", DataType::Int32.clone(), false); let schema = Schema::new(vec![numbers]); let file = fs::File::create(&path).unwrap(); let props = WriterProperties::builder() .set_encoding(parquet::basic::Encoding::DELTA_BINARY_PACKED) .set_compression(parquet::basic::Compression::UNCOMPRESSED) .set_writer_version(parquet::file::properties::WriterVersion::PARQUET_2_0); let mut writer = ArrowWriter::try_new(file, schema.into(), Some(props.build())).unwrap(); let mut numbers = PrimitiveBuilder::<Int32Type>::new(); for j in 0..10000 { for _i in 0..10000 { numbers.append_value(j); } } let batch = RecordBatch::try_from_iter(vec![("numbers", Arc::new(numbers.finish()) as ArrayRef)]) .unwrap(); writer.write(&batch).expect("Writing batch"); writer.close().unwrap(); } ``` Running this produces: ``` ➜ parquet-demo git:(main) ✗ cargo run && du -sh sample.parquet Compiling parquet-demo v0.1.0 (/Users/tmcw/s/parquet-demo) Finished dev [unoptimized + debuginfo] target(s) in 0.41s Running `target/debug/parquet-demo` 136K sample.parquet ``` So, that's with DELTA_BIT_PACKED encoding, and the dataset is lots of consecutive identical values - 0000000111111122222 - that kind of thing. It should be amenable to delta encoding or RLE. Trying PLAIN encoding: ```rs use arrow::datatypes::{DataType, Field, Schema}; use arrow_array::{builder::PrimitiveBuilder, types::Int32Type, ArrayRef, RecordBatch}; use parquet::{arrow::ArrowWriter, file::properties::WriterProperties}; use std::{fs, path::Path, sync::Arc}; fn main() { let path = Path::new("sample.parquet"); let numbers = Field::new("numbers", DataType::Int32.clone(), false); let schema = Schema::new(vec![numbers]); let file = fs::File::create(&path).unwrap(); let props = WriterProperties::builder() .set_encoding(parquet::basic::Encoding::PLAIN) .set_compression(parquet::basic::Compression::UNCOMPRESSED) .set_writer_version(parquet::file::properties::WriterVersion::PARQUET_2_0); let mut writer = ArrowWriter::try_new(file, schema.into(), Some(props.build())).unwrap(); let mut numbers = PrimitiveBuilder::<Int32Type>::new(); for j in 0..10000 { for _i in 0..10000 { numbers.append_value(j); } } let batch = RecordBatch::try_from_iter(vec![("numbers", Arc::new(numbers.finish()) as ArrayRef)]) .unwrap(); writer.write(&batch).expect("Writing batch"); writer.close().unwrap(); } ``` ``` ➜ parquet-demo git:(main) ✗ cargo run && du -sh sample.parquet Compiling parquet-demo v0.1.0 (/Users/tmcw/s/parquet-demo) Finished dev [unoptimized + debuginfo] target(s) in 0.39s Running `target/debug/parquet-demo` 136K sample.parquet ``` Same exact size. The same goes if I swap out PLAIN for RLE, or any other encoding value: it's always the same size. The same goes for my much larger dataset that I'm trying to encode. I'm totally new to this domain, so this easily could be that I'm using it wrong! I've also tried `.set_column_encoding` with the same result. I don't know what's going wrong - any ideas? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
