jdroenner opened a new issue, #4724:
URL: https://github.com/apache/arrow-rs/issues/4724
(Don't really know if this is a bug or a feature)
**Describe the bug**
We have a use-case where we want to store Arrow data (`StructArray`)
compressed in memory. Currently the best way for this appears to be the IPC
`FileWriter` with compression and then storing the compressed bytes. For our
use-case it is important to know that the uncompressed data has the same size
in memory as the original input.
Using the `FileReader` we noticed that our assumption, that the decompressed
data has the same size as the original input, does not hold. There is a case
that can cause the memory footprint of the decompressed data to become ~
"decompressed bytes + compressed bytes".
If `CompressionCodec` `decompress_to_buffer` detects an uncompressed slice
of data (`decompressed_length == LENGTH_NO_COMPRESSED_DATA`) it returns a new
`Buffer` that uses a pointer into the `Buffer` that holds the raw bytes
produced by the reader. This causes the raw `Buffer` to be kept in memory.
Additionally the compressed data slices are decompressed into new `Buffers`
that use newly allocated memory.
**To Reproduce**
The easiest way is to create a `StructArray` with one column that is good to
compress and one that is not:
```rust
use std::{io::Cursor, sync::Arc};
use arrow::{
array::{Array, ArrayRef, Int32Array, StringArray, StructArray},
datatypes::{DataType, Field, Fields},
ipc::{
writer::{FileWriter, IpcWriteOptions},
CompressionType,
},
record_batch::RecordBatch,
};
fn main() {
// Setup StructArray with two columns
let primes = Arc::new(Int32Array::from(vec![
31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,
]));
let strings = Arc::new(StringArray::from(vec![
"Hello world one",
"Hello world two",
"Hello world three",
"Hello world four",
"Hello world five",
"Hello world six",
"Hello world seven",
"Hello world eight",
"Hello world nine",
"Hello world ten",
"Hello world eleven",
"Hello world twelve",
]));
let table = StructArray::new(
Fields::from(vec![
Field::new("primes", DataType::Int32, false),
Field::new("strings", DataType::Utf8, false),
]),
vec![primes as ArrayRef, strings as ArrayRef],
None,
);
// Print the size of the original data
println!("Original size: {:?}", table.get_array_memory_size());
// Write to file with compression
let record_batch = RecordBatch::from(&table);
let mut file_writer = FileWriter::try_new_with_options(
Vec::new(),
record_batch.schema().as_ref(),
IpcWriteOptions::default()
.try_with_compression(Some(CompressionType::LZ4_FRAME))
.unwrap(),
)
.unwrap();
file_writer.write(&record_batch).unwrap();
file_writer.finish().unwrap();
let bytes = file_writer.into_inner().unwrap();
// Print the size of the compressed data
println!("Compressed size: {:?}", bytes.len());
// Read from file and decompress
let mut reader =
arrow::ipc::reader::FileReader::try_new(Cursor::new(bytes), None).unwrap();
let record_batch = reader.next().unwrap().unwrap();
let sr = StructArray::from(record_batch);
// Print the size of the decompressed data
println!("Decompressed size: {:?}", sr.get_array_memory_size());
}
```
The output will be something like this:
```bash
Original size: 688
Compressed size: 950
Decompressed size: 1104
```
**Expected behavior**
We expected the decompressed data to use the same amount of memory as the
original data.
An easy way to fix this would be to change the "uncompressed in compressed"
case in `decompress_to_buffer` to return a real copy of the uncompressed data
slice.
The easy fix, might cause a lot of memcopys in cases where compression was
preferred but a lot of data could not be compressed. To make the behavior a
user choice, we could add a "copy_uncompressed" parameter to the
`decompress_to_buffer` method. Since the user only has access to readers, this
should probably be set as a reader option that is passed down to the
decompression step...
**Additional context**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]